Cleaning Noisy and Heterogeneous Metadata for Record Linking Across Scholarly Big Datasets
Pith reviewed 2026-05-25 19:25 UTC · model grok-4.3
The pith
A system combining metadata features and citation data matches noisy scholarly records with high accuracy.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The combination of metadata and citation achieves high accuracy that significantly outperforms the baseline method on the same test dataset when matching scholarly document entities with noisy metadata against a reference dataset.
What carries the argument
Supervised classifier on features from all metadata fields plus citation information, after BM25 blocking on an ElasticSearch index.
If this is right
- Enables cleaning of CiteSeerX metadata through linkage to external datasets.
- Supports cross-linking of records across multiple large scholarly corpora.
- Improves entity resolution precision when titles alone are unreliable due to noise.
Where Pith is reading between the lines
- The method may apply to other noisy metadata domains if auxiliary link signals similar to citations exist.
- Deployment could indirectly improve search quality and citation network analysis in the target system.
- Further tests on datasets lacking citations would clarify the contribution of each component.
Load-bearing premise
Citation information is reliably available and accurate enough to be leveraged as a matching feature without introducing new errors.
What would settle it
A controlled test on records where citation data is deliberately removed or corrupted, checking whether accuracy falls below the metadata-only baseline.
Figures
read the original abstract
Automatically extracted metadata from scholarly documents in PDF formats is usually noisy and heterogeneous, often containing incomplete fields and erroneous values. One common way of cleaning metadata is to use a bibliographic reference dataset. The challenge is to match records between corpora with high precision. The existing solution which is based on information retrieval and string similarity on titles works well only if the titles are cleaned. We introduce a system designed to match scholarly document entities with noisy metadata against a reference dataset. The blocking function uses the classic BM25 algorithm to find the matching candidates from the reference data that has been indexed by ElasticSearch. The core components use supervised methods which combine features extracted from all available metadata fields. The system also leverages available citation information to match entities. The combination of metadata and citation achieves high accuracy that significantly outperforms the baseline method on the same test dataset. We apply this system to match the database of CiteSeerX against Web of Science, PubMed, and DBLP. This method will be deployed in the CiteSeerX system to clean metadata and link records to other scholarly big datasets.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper describes a system for matching noisy, heterogeneous metadata records from CiteSeerX against reference corpora (Web of Science, PubMed, DBLP). It uses BM25 blocking via ElasticSearch to retrieve candidate matches, followed by a supervised classifier that combines features from all available metadata fields and, when present, citation information. The central claim is that the metadata-plus-citation combination achieves high accuracy that significantly outperforms a title-based baseline on the same test dataset; the method is intended for deployment inside CiteSeerX.
Significance. If the performance gains are shown to be robust and the citation-handling details are supplied, the work could supply a practical, deployable pipeline for large-scale scholarly record linking. The use of an off-the-shelf IR engine for blocking and the explicit incorporation of citation strings are pragmatic engineering choices that address real pain points in noisy PDF-extracted metadata.
major comments (3)
- [Abstract, §4] Abstract and §4 (system description): the headline claim that 'the combination of metadata and citation achieves high accuracy that significantly outperforms the baseline' is unsupported by any reported precision, recall, F1, accuracy, or dataset-size figures, nor by an explicit statement of the evaluation protocol or test-set construction. Without these numbers the central performance assertion cannot be assessed.
- [§4, §5] §4 (citation component) and §5 (experiments): no statistics are given on (a) the fraction of CiteSeerX records that possess usable citation strings, (b) the routing of records lacking citations (metadata-only path versus drop), or (c) an ablation that isolates the incremental contribution of the citation feature. The skeptic concern that reported gains may be an artifact of evaluating only on the citation-rich subset therefore remains unaddressed and is load-bearing for the 'significantly outperforms' claim.
- [§3, §4] §3 (blocking) and §4 (classifier): the manuscript supplies no description of the feature set, training procedure, hyper-parameter choices, or cross-validation protocol used by the supervised classifier, preventing verification that the reported superiority is not an artifact of over-fitting or an unrepresentative test split.
minor comments (2)
- [Abstract, §1] The abstract and introduction repeatedly use the phrase 'high accuracy' without defining the metric or providing a numeric threshold; replace with concrete measures once the evaluation is added.
- [§4] Notation for the supervised model (features, label, loss) is introduced informally; a short table or equation block would improve clarity.
Simulated Author's Rebuttal
We thank the referee for the constructive comments highlighting the need for quantitative results, citation statistics, and methodological details. We will revise the manuscript to incorporate all requested information and clarifications, strengthening the presentation of our record linkage pipeline.
read point-by-point responses
-
Referee: [Abstract, §4] Abstract and §4 (system description): the headline claim that 'the combination of metadata and citation achieves high accuracy that significantly outperforms the baseline' is unsupported by any reported precision, recall, F1, accuracy, or dataset-size figures, nor by an explicit statement of the evaluation protocol or test-set construction. Without these numbers the central performance assertion cannot be assessed.
Authors: We agree the abstract and Section 4 lack the supporting numerical results and protocol details. In the revised version we will report precision, recall, F1, accuracy, and dataset sizes for the metadata-plus-citation method versus the title baseline, together with a clear description of the test-set construction and evaluation protocol used on the CiteSeerX matching tasks. revision: yes
-
Referee: [§4, §5] §4 (citation component) and §5 (experiments): no statistics are given on (a) the fraction of CiteSeerX records that possess usable citation strings, (b) the routing of records lacking citations (metadata-only path versus drop), or (c) an ablation that isolates the incremental contribution of the citation feature. The skeptic concern that reported gains may be an artifact of evaluating only on the citation-rich subset therefore remains unaddressed and is load-bearing for the 'significantly outperforms' claim.
Authors: We accept that these statistics and the ablation are necessary. The revision will state the fraction of CiteSeerX records containing usable citation strings, describe the routing logic for records without citations (metadata-only path), and add an ablation comparing performance with and without the citation features to address the concern about subset bias. revision: yes
-
Referee: [§3, §4] §3 (blocking) and §4 (classifier): the manuscript supplies no description of the feature set, training procedure, hyper-parameter choices, or cross-validation protocol used by the supervised classifier, preventing verification that the reported superiority is not an artifact of over-fitting or an unrepresentative test split.
Authors: We will expand the classifier description in the revised Section 4 to list all metadata and citation features, the supervised learning algorithm, hyper-parameter selection approach, and the cross-validation protocol, thereby allowing verification that the performance gains are not due to overfitting or split artifacts. revision: yes
Circularity Check
No significant circularity; empirical systems evaluation on external data
full rationale
The manuscript is a systems description of a record-linking pipeline (BM25 blocking via ElasticSearch followed by supervised classification on metadata plus citation features). It reports empirical accuracy on an external test dataset without any mathematical derivation, parameter fitting presented as prediction, or self-citation chain that reduces the central claim to its own inputs. The performance numbers are obtained by direct comparison against a held-out reference set and therefore remain falsifiable outside the paper's own construction.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
" write newline "" before.all 'output.state := FUNCTION fin.entry add.period write newline FUNCTION new.block output.state before.all = 'skip after.block 'output.state := if FUNCTION new.sentence output.state after.block = 'skip output.state before.all = 'skip after.sentence 'output.state := if if FUNCTION not #0 #1 if FUNCTION and 'skip pop #0 if FUNCTIO...
-
[2]
Al-Zaidy, R. A., and Giles, C. L. 2017. A machine learning approach for semantic structuring of scientific charts in scholarly documents. In AAAI , 4644--4649
work page 2017
-
[3]
Caragea, C.; Wu, J.; Ciobanu, A.; Williams, K.; Fern \'a ndez-Ram \'i rez, J.; Chen, H.-H.; Wu, Z.; and Giles, L. 2014. Citeseerx: A scholarly big dataset. In Advances in Information Retrieval: 36th European Conference on IR Research, ECIR 2014, Amsterdam, The Netherlands, April 13-16, 2014. Proceedings , 311--322
work page 2014
-
[4]
Charikar, M. 2002. Similarity estimation techniques from rounding algorithms. In Proceedings on 34th Annual ACM Symposium on Theory of Computing, May 19-21, 2002, Montr \' e al, Qu \' e bec, Canada , 380--388
work page 2002
-
[5]
Chen, C.; Wang, Z.; Li, W.; and Sun, X. 2018. Modeling scientific influence for research trending topic prediction. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, New Orleans, Louisiana, USA, February 2-7, 2018
work page 2018
-
[6]
Councill, I.; Giles, C. L.; and Kan, M.-Y. 2008. Parscit: an open-source crf reference string parsing package. In Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)
work page 2008
-
[7]
Giles, C. L.; Bollacker, K. D.; and Lawrence, S. 1998. Citeseer: An automatic citation indexing system. In Proceedings of the 3rd ACM International Conference on Digital Libraries, June 23-26, 1998, Pittsburgh, PA, USA , 89--98
work page 1998
-
[8]
Giles, C. L. 2013. Scholarly big data: Information extraction and data mining. In Proceedings of the 22Nd ACM International Conference on Information & Knowledge Management , CIKM '13, 1--2. New York, NY, USA: ACM
work page 2013
-
[9]
Huang, W.; Wu, Z.; Chen, L.; Mitra, P.; and Giles, C. L. 2015. A neural probabilistic model for context based citation recommendation. In Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, January 25-30, 2015, Austin, Texas, USA. , 2404--2410
work page 2015
-
[10]
Kim, K.; Sefid, A.; and Giles, C. L. 2017. Scaling author name disambiguation with cnf blocking. arXiv preprint arXiv:1709.09657
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[11]
Liu, X.; Yan, J.; Xiao, S.; Wang, X.; Zha, H.; and Chu, S. M. 2017. On predictive patent valuation: Forecasting patent citations and their types. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, February 4-9, 2017, San Francisco, California, USA. , 1438--1444
work page 2017
-
[12]
Peled, O.; Fire, M.; Rokach, L.; and Elovici, Y. 2013. Entity matching in online social networks. In International Conference on Social Computing, SocialCom 2013, SocialCom/PASSAT/BigData/EconCom/BioMedCom 2013, Washington, DC, USA, 8-14 September, 2013 , 339--344
work page 2013
-
[13]
Robertson, S.; Zaragoza, H.; and Taylor, M. 2004. Simple bm25 extension to multiple weighted fields. In Proceedings of the Thirteenth ACM International Conference on Information and Knowledge Management , CIKM '04, 42--49. New York, NY, USA: ACM
work page 2004
-
[14]
Wesley-Smith, I., and West, J. D. 2016. Babel: A platform for facilitating research in scholarly article discovery. In Proceedings of the 25th International Conference Companion on World Wide Web , WWW '16 Companion, 389--394. Republic and Canton of Geneva, Switzerland: International World Wide Web Conferences Steering Committee
work page 2016
-
[15]
Wu, J.; Williams, K.; Chen, H.; Khabsa, M.; Caragea, C.; Ororbia, A.; Jordan, D.; and Giles, C. L. 2014. Citeseerx: AI in a digital library search engine. In Proceedings of the 28th AAAI Conference on Artificial Intelligence, July 27 -31, 2014, Qu \' e bec City, Qu \' e bec, Canada. , 2930--2937
work page 2014
-
[16]
Wu, J.; Sefid, A.; Ge, A. C.; and Giles, C. L. 2017. A supervised learning approach to entity matching between scholarly big datasets. In Proceedings of the Knowledge Capture Conference , K-CAP 2017, 42:1--42:4. New York, NY, USA
work page 2017
-
[17]
Yang, Y.; Sun, Y.; Tang, J.; Ma, B.; and Li, J. 2015. Entity matching across heterogeneous sources. In Cao, L.; Zhang, C.; Joachims, T.; Webb, G. I.; Margineantu, D. D.; and Williams, G., eds., Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Sydney, NSW, Australia, August 10-13, 2015 , 1395--1404. ACM
work page 2015
-
[18]
B\" o hm, C.; de Melo, G.; Naumann, F.; and Weikum, G. 2012. Linda: Distributed web-of-data-scale entity matching. In Proceedings of the 21st ACM International Conference on Information and Knowledge Management , CIKM '12, 2104--2108. New York, NY, USA: ACM
work page 2012
-
[19]
Caragea, C.; Wu, J.; Ciobanu, A.; Williams, K.; Fern \'a ndez-Ram \'i rez, J.; Chen, H.-H.; Wu, Z.; and Giles, L. 2014. CiteSeerX: A Scholarly Big Dataset . Cham: Springer International Publishing. 311--322
work page 2014
-
[20]
Cock, M. D.; Roy, S. B.; Savvana, S.; Mandava, V.; Dalessandro, B.; Perlich, C.; Cukierski, W.; and Hamner, B. 2013. The microsoft academic search challenges at kdd cup 2013. In Big Data, 2013 IEEE International Conference on , 1--4
work page 2013
-
[21]
Ley, M. 2009. DBLP - some lessons learned. PVLDB 2(2):1493--1500
work page 2009
-
[22]
Lipinski, M.; Yao, K.; Breitinger, C.; Beel, J.; and Gipp, B. 2013. Evaluation of header metadata extraction approaches and tools for scientific pdf documents. In Proceedings of the 13th ACM/IEEE-CS Joint Conference on Digital Libraries , JCDL '13, 385--386. New York, NY, USA: ACM
work page 2013
-
[23]
Lopez, P. 2009. Grobid: Combining automatic bibliographic data recognition and term extraction for scholarship publications. In Proceedings of the 13th European Conference on Research and Advanced Technology for Digital Libraries , ECDL'09, 473--474. Berlin, Heidelberg: Springer-Verlag
work page 2009
-
[24]
NIH. 2016. Fact sheet medline. https://www.nlm.nih.gov/pubs/factsheets/medline.html. [Online; accessed 09-September-2016]
work page 2016
-
[25]
Olensky, M.; Tsai, T.-H.; and Chen, K.-T. 2016. H-index sequences across fields: A comparative analysis. In Proceedings of the 25th International Conference Companion on World Wide Web , WWW '16 Companion, 407--412. Republic and Canton of Geneva, Switzerland: International World Wide Web Conferences Steering Committee
work page 2016
-
[26]
Peled, O.; Fire, M.; Rokach, L.; and Elovici, Y. 2013. Entity matching in online social networks. In International Conference on Social Computing, SocialCom 2013, SocialCom/PASSAT/BigData/EconCom/BioMedCom 2013, Washington, DC, USA, 8-14 September, 2013 , 339--344. IEEE Computer Society
work page 2013
-
[27]
Sinatra, R.; Deville, P.; Szell, M.; Wang, D.; and Barabasi, A.-L. 2015. A century of physics. Nat Phys 11(10):791--796
work page 2015
-
[28]
Sinha, A.; Shen, Z.; Song, Y.; Ma, H.; Eide, D.; Hsu, B.-J. P.; and Wang, K. 2015. An Overview of Microsoft Academic Service ( MAS ) and Applications . In Proceedings of the 24th International Conference on World Wide Web , WWW '15 Companion , 243--246
work page 2015
-
[29]
Stamenovic, M.; Schick, S.; and Luo, J. 2017. Machine identification of high impact research through text and image analysis. In Third IEEE International Conference on Multimedia Big Data, BigMM 2017, Laguna Hills, CA, USA, April 19-21, 2017 , 98--104
work page 2017
-
[30]
Wang, Y.; Zhang, H.; Li, Y.; Wang, D.; Ma, Y.; Zhou, T.; and Lu, J. 2016. A data cleaning method for citeseer dataset. In Web Information Systems Engineering - WISE 2016 - 17th International Conference, Shanghai, China, November 8-10, 2016, Proceedings, Part I , 35--49
work page 2016
-
[31]
Whalen, R.; Huang, Y.; Tanis, C.; Sawant, A.; Uzzi, B.; and Contractor, N. 2016. Citation distance: Measuring changes in scientific search strategies. In Proceedings of the 25th International Conference Companion on World Wide Web , WWW '16 Companion, 419--423. Republic and Canton of Geneva, Switzerland: International World Wide Web Conferences Steering Committee
work page 2016
-
[32]
Williams, K., and Giles, C. L. 2013. Near duplicate detection in an academic digital library. In Proceedings of the 2013 ACM Symposium on Document Engineering , DocEng '13, 91--94. New York, NY, USA: ACM
work page 2013
-
[33]
Wu, J.; Liang, C.; Yang, H.; and Giles, C. L. 2016. Citeseerx data: Semanticizing scholarly papers. In Proceedings of the International Workshop on Semantic Big Data , SBD '16, 2:1--2:6. New York, NY, USA: ACM
work page 2016
-
[34]
Wu, J.; Sefid, A.; Ge, A. C.; and Giles, C. L. 2017. A supervised learning approach to entity matching between scholarly big datasets. In Proceedings of the Knowledge Capture Conference , K-CAP 2017, 42:1--42:4. New York, NY, USA: ACM
work page 2017
-
[35]
Xiao, S.; Yan, J.; Li, C.; Jin, B.; Wang, X.; Yang, X.; Chu, S. M.; and Zha, H. 2016. On modeling and predicting individual paper citation count over time. In Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, IJCAI 2016, New York, NY, USA, 9-15 July 2016 , 2676--2682
work page 2016
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.