An Information Extraction and Knowledge Graph Platform for Accelerating Biochemical Discoveries

Akihiro Fujita; Christoph Auer; Costas Bekas; Federico Zipoli; Hiroki Toda; Matteo Manica; Michele Dolfi; Peter Staar; Shuichi Hirose; Teodoro Laino

arxiv: 1907.08400 · v1 · pith:ARCRQ6E6new · submitted 2019-07-19 · 💻 cs.IR · cs.LG

An Information Extraction and Knowledge Graph Platform for Accelerating Biochemical Discoveries

Matteo Manica , Christoph Auer , Valery Weber , Federico Zipoli , Michele Dolfi , Peter Staar , Teodoro Laino , Costas Bekas

show 4 more authors

Akihiro Fujita Hiroki Toda Shuichi Hirose Yasumitsu Orii

This is my paper

Pith reviewed 2026-05-24 19:30 UTC · model grok-4.3

classification 💻 cs.IR cs.LG

keywords information extractionknowledge graphbiochemical literaturePDF ingestioncarbohydrate enzymesdata integrationscalable system

0 comments

The pith

A biochemistry knowledge graph built by ingesting databases and PDF publications enables queries for known facts and generation of novel insights.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a scalable ingestion framework that pulls structured data from existing databases together with facts automatically extracted from PDF publications and assembles them into a single biochemistry knowledge graph. This graph is described as a comprehensive, queryable repository that supports both retrieval of established biochemical relationships and the production of new insights. The approach is demonstrated on carbohydrate enzymes, with the stated aim of lowering the time and cost of discovery work in domains such as food safety and pharmaceutics. A sympathetic reader would see the value in replacing manual literature review with an automated, integrated knowledge source.

Core claim

The BCKG is a comprehensive source of knowledge that can be queried to retrieve known biochemical facts and to generate novel insights. The system integrates data from databases and publications in PDF format through a scalable document ingestion framework and is illustrated by an application in the field of carbohydrate enzymes.

What carries the argument

The biochemistry knowledge graph (BCKG), which integrates extracted facts from PDFs and databases into a single queryable structure that supports both fact retrieval and insight generation.

If this is right

Queries on the BCKG retrieve known biochemical facts at scale.
Novel insights can be generated by traversing relationships stored in the integrated graph.
Knowledge ingestion scales to large volumes of biochemical publications without proportional manual effort.
The same ingestion pipeline reduces time to solution in application areas such as food safety and pharmaceutics.
The carbohydrate-enzyme demonstration shows the graph can be applied to a concrete biochemical subdomain.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same ingestion approach could be applied to literature in adjacent domains such as synthetic biology or toxicology.
Periodic re-ingestion of new PDFs would be required to keep the graph current with the growing literature.
Downstream machine-learning models trained on the graph could generate testable hypotheses that go beyond explicit retrieval.

Load-bearing premise

Automated information extraction from PDF publications produces sufficiently accurate and complete biochemical facts to support reliable queries and novel insights without substantial human correction.

What would settle it

A direct comparison in which a non-trivial fraction of facts returned by queries on the BCKG are shown to be missing or incorrect when checked against primary literature or expert curation would falsify the claim that the graph reliably supports queries and novel insights.

Figures

Figures reproduced from arXiv: 1907.08400 by Akihiro Fujita, Christoph Auer, Costas Bekas, Federico Zipoli, Hiroki Toda, Matteo Manica, Michele Dolfi, Peter Staar, Shuichi Hirose, Teodoro Laino, Valery Weber, Yasumitsu Orii.

**Figure 1.** Figure 1: BCKG concept and current structure. The platform ingests knowledge from different data sources and implements graph analytics techniques in a comprehensive and queryable knowledge base (left). The currently assembled KG integrates multiple data sources organizing them in linked collections of nodes (right). 2 [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

read the original abstract

Information extraction and data mining in biochemical literature is a daunting task that demands resource-intensive computation and appropriate means to scale knowledge ingestion. Being able to leverage this immense source of technical information helps to drastically reduce costs and time to solution in multiple application fields from food safety to pharmaceutics. We present a scalable document ingestion system that integrates data from databases and publications (in PDF format) in a biochemistry knowledge graph (BCKG). The BCKG is a comprehensive source of knowledge that can be queried to retrieve known biochemical facts and to generate novel insights. After describing the knowledge ingestion framework, we showcase an application of our system in the field of carbohydrate enzymes. The BCKG represents a way to scale knowledge ingestion and automatically exploit prior knowledge to accelerate discovery in biochemical sciences.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The manuscript describes a scalable document ingestion framework that extracts information from biochemical PDF publications and integrates it with database data to construct a biochemistry knowledge graph (BCKG). It claims the resulting BCKG is a comprehensive, queryable source of known facts that can also generate novel insights, and illustrates the approach via a carbohydrate-enzyme application.

Significance. If the automated extraction pipeline were shown to produce sufficiently accurate and complete triples, the platform could meaningfully accelerate biochemical research by enabling structured querying over literature-scale knowledge. The work targets a genuine scalability bottleneck in the domain.

major comments (2)

[Abstract] Abstract: the central claim that the BCKG 'is a comprehensive source of knowledge' that 'can be queried to retrieve known biochemical facts and to generate novel insights' is unsupported because the manuscript supplies no precision, recall, or other quantitative accuracy metrics for the PDF information-extraction pipeline.
[Application section] Application section (carbohydrate-enzyme showcase): the description of the BCKG usage contains no held-out validation, inter-annotator agreement, or comparison against manually curated gold-standard triples, leaving the reliability of the asserted queries and insights untested.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major comment below and indicate the revisions we will make.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that the BCKG 'is a comprehensive source of knowledge' that 'can be queried to retrieve known biochemical facts and to generate novel insights' is unsupported because the manuscript supplies no precision, recall, or other quantitative accuracy metrics for the PDF information-extraction pipeline.

Authors: We agree that the abstract advances strong claims about the BCKG without supporting quantitative metrics for extraction accuracy. The manuscript centers on the design of a scalable ingestion and integration framework rather than a comprehensive accuracy evaluation. In the revised manuscript we will either add a limited evaluation (precision/recall on a manually inspected sample of triples) or moderate the abstract language to describe the BCKG as an extensible platform whose completeness depends on the quality of its sources. revision: yes
Referee: [Application section] Application section (carbohydrate-enzyme showcase): the description of the BCKG usage contains no held-out validation, inter-annotator agreement, or comparison against manually curated gold-standard triples, leaving the reliability of the asserted queries and insights untested.

Authors: The carbohydrate-enzyme section is presented as an illustrative use case rather than a validated benchmark. We acknowledge that the absence of held-out validation or gold-standard comparison leaves the reliability of the demonstrated queries untested. In revision we will add an explicit limitations paragraph and, where data permit, include spot-checks against database entries or a small manually verified set to illustrate consistency. revision: yes

Circularity Check

0 steps flagged

No circularity: systems paper with no derivations or predictions

full rationale

The manuscript describes a document ingestion pipeline that populates a biochemistry knowledge graph (BCKG) from PDFs and databases and demonstrates its use on carbohydrate enzymes. No equations, fitted parameters, predictions, or uniqueness theorems appear; the central claims are architectural and descriptive rather than derived. Consequently no step reduces by construction to its own inputs, no self-citation chain is load-bearing for a result, and the paper is self-contained against external benchmarks. This matches the expected finding for a non-mathematical systems contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper is an engineering description of a document-ingestion pipeline; it introduces no mathematical free parameters, axioms, or postulated entities.

pith-pipeline@v0.9.0 · 5699 in / 969 out tokens · 24605 ms · 2026-05-24T19:30:08.981512+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

16 extracted references · 16 canonical work pages

[1]

Benson, Mark Cavanaugh, Karen Clark, et al

Dennis A. Benson, Mark Cavanaugh, Karen Clark, et al . 2017. GenBank. Nucleic Acids Research 45, D1 (jan 2017), D37–D42. https://doi.org/10.1093/nar/gkw1070

work page doi:10.1093/nar/gkw1070 2017
[2]

Berman, Tammy Battistuz, T

Helen M. Berman, Tammy Battistuz, T. N. Bhat, et al. 2002. The protein data bank. Acta Crystallographica Section D: Biological Crystallography 58, 6 I (jan 2002), 899–907. https://doi.org/10.1107/S0907444902003451

work page doi:10.1107/s0907444902003451 2002
[3]

Cantarel, Pedro M

Brandi I. Cantarel, Pedro M. Coutinho, Corinne Rancurel, et al. 2009. The Carbohydrate-Active EnZymes database (CAZy): An expert resource for glycogenomics. Nucleic Acids Research 37, SUPPL. 1 (jan 2009), D233–8. https: //doi.org/10.1093/nar/gkn663

work page doi:10.1093/nar/gkn663 2009
[4]

Sara El-Gebali, Jaina Mistry, Alex Bateman, et al. 2019. The Pfam protein families database in 2019. Nucleic Acids Research 47, D1 (jan 2019), D427–D432. https://doi.org/10.1093/nar/gky995

work page doi:10.1093/nar/gky995 2019
[5]

Scott Federhen. 2012. The NCBI Taxonomy database. Nucleic Acids Research 40, D1 (jan 2012), D136–43. https: //doi.org/10.1093/nar/gkr1178

work page doi:10.1093/nar/gkr1178 2012
[6]

Nowotka, et al

Anna Gaulton, Anne Hersey, Micha L. Nowotka, et al. 2017. The ChEMBL database in 2017. Nucleic Acids Research 45, D1 (2017), D945–D954. https://doi.org/10.1093/nar/gkw1074

work page doi:10.1093/nar/gkw1074 2017
[7]

Takanobu Higashiyama. 2002. Novel functions and applications of trehalose. Pure and Applied Chemistry 74, 7 (jan 2002), 1263–1269. https://doi.org/10.1351/pac200274071263

work page doi:10.1351/pac200274071263 2002
[8]

Lisa Jeske, Sandra Placzek, Ida Schomburg, et al . 2019. BRENDA in 2019: A European ELIXIR core data resource. Nucleic Acids Research 47, D1 (jan 2019), D542–D549. https://doi.org/10.1093/nar/gky1048

work page doi:10.1093/nar/gky1048 2019
[9]

Minoru Kanehisa, Miho Furumichi, Mao Tanabe, et al. 2017. KEGG: New perspectives on genomes, pathways, diseases and drugs. Nucleic Acids Research 45, D1 (jan 2017), D353–D361. https://doi.org/10.1093/nar/gkw1092

work page doi:10.1093/nar/gkw1092 2017
[10]

Sunghwan Kim, Jie Chen, Tiejun Cheng, et al. 2019. PubChem 2019 update: Improved access to chemical data. Nucleic Acids Research 47, D1 (jan 2019), D1102–D1109. https://doi.org/10.1093/nar/gky1033

work page doi:10.1093/nar/gky1033 2019
[11]

Hiroyuki Ogata, Susumu Goto, Kazushige Sato, et al. 1999. KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Research 27, 1 (jan 1999), 29–34. https://doi.org/10.1093/nar/27.1.29

work page doi:10.1093/nar/27.1.29 1999
[12]

Sayers, Richa Agarwala, Evan E

Eric W. Sayers, Richa Agarwala, Evan E. Bolton, et al. 2019. Database resources of the National Center for Biotechnology Information. Nucleic Acids Research 47, D1 (jan 2019), D23–D28. https://doi.org/10.1093/nar/gky1069

work page doi:10.1093/nar/gky1069 2019
[13]

Peter W J Staar, Michele Dolfi, Christoph Auer, et al. 2018. Corpus Conversion Service. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining - KDD ’18 . ACM Press, New York, New York, USA, 774–782. https://doi.org/10.1145/3219819.3219834

work page doi:10.1145/3219819.3219834 2018
[14]

Neil Swainston, Riza Batista-Navarro, Pablo Carbonell, et al. 2017. biochem4j: Integrated and extensible biochemical knowledge through graph databases. PLoS ONE 12, 7 (jul 2017), e0179130. https://doi.org/10.1371/journal.pone.0179130

work page doi:10.1371/journal.pone.0179130 2017
[15]

The UniProt Consortium. 2018. UniProt: a worldwide hub of protein knowledge. Nucleic Acids Research 47, D1 (jan 2018), D506–D515. https://doi.org/10.1093/nar/gky1049

work page doi:10.1093/nar/gky1049 2018
[16]

Kevin J. Yarema. 2010. Handbook of Carbohydrate Engineering . Taylor & Francis. 904 pages. https://doi.org/10.1201/ 9781420027631 4

work page 2010

[1] [1]

Benson, Mark Cavanaugh, Karen Clark, et al

Dennis A. Benson, Mark Cavanaugh, Karen Clark, et al . 2017. GenBank. Nucleic Acids Research 45, D1 (jan 2017), D37–D42. https://doi.org/10.1093/nar/gkw1070

work page doi:10.1093/nar/gkw1070 2017

[2] [2]

Berman, Tammy Battistuz, T

Helen M. Berman, Tammy Battistuz, T. N. Bhat, et al. 2002. The protein data bank. Acta Crystallographica Section D: Biological Crystallography 58, 6 I (jan 2002), 899–907. https://doi.org/10.1107/S0907444902003451

work page doi:10.1107/s0907444902003451 2002

[3] [3]

Cantarel, Pedro M

Brandi I. Cantarel, Pedro M. Coutinho, Corinne Rancurel, et al. 2009. The Carbohydrate-Active EnZymes database (CAZy): An expert resource for glycogenomics. Nucleic Acids Research 37, SUPPL. 1 (jan 2009), D233–8. https: //doi.org/10.1093/nar/gkn663

work page doi:10.1093/nar/gkn663 2009

[4] [4]

Sara El-Gebali, Jaina Mistry, Alex Bateman, et al. 2019. The Pfam protein families database in 2019. Nucleic Acids Research 47, D1 (jan 2019), D427–D432. https://doi.org/10.1093/nar/gky995

work page doi:10.1093/nar/gky995 2019

[5] [5]

Scott Federhen. 2012. The NCBI Taxonomy database. Nucleic Acids Research 40, D1 (jan 2012), D136–43. https: //doi.org/10.1093/nar/gkr1178

work page doi:10.1093/nar/gkr1178 2012

[6] [6]

Nowotka, et al

Anna Gaulton, Anne Hersey, Micha L. Nowotka, et al. 2017. The ChEMBL database in 2017. Nucleic Acids Research 45, D1 (2017), D945–D954. https://doi.org/10.1093/nar/gkw1074

work page doi:10.1093/nar/gkw1074 2017

[7] [7]

Takanobu Higashiyama. 2002. Novel functions and applications of trehalose. Pure and Applied Chemistry 74, 7 (jan 2002), 1263–1269. https://doi.org/10.1351/pac200274071263

work page doi:10.1351/pac200274071263 2002

[8] [8]

Lisa Jeske, Sandra Placzek, Ida Schomburg, et al . 2019. BRENDA in 2019: A European ELIXIR core data resource. Nucleic Acids Research 47, D1 (jan 2019), D542–D549. https://doi.org/10.1093/nar/gky1048

work page doi:10.1093/nar/gky1048 2019

[9] [9]

Minoru Kanehisa, Miho Furumichi, Mao Tanabe, et al. 2017. KEGG: New perspectives on genomes, pathways, diseases and drugs. Nucleic Acids Research 45, D1 (jan 2017), D353–D361. https://doi.org/10.1093/nar/gkw1092

work page doi:10.1093/nar/gkw1092 2017

[10] [10]

Sunghwan Kim, Jie Chen, Tiejun Cheng, et al. 2019. PubChem 2019 update: Improved access to chemical data. Nucleic Acids Research 47, D1 (jan 2019), D1102–D1109. https://doi.org/10.1093/nar/gky1033

work page doi:10.1093/nar/gky1033 2019

[11] [11]

Hiroyuki Ogata, Susumu Goto, Kazushige Sato, et al. 1999. KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Research 27, 1 (jan 1999), 29–34. https://doi.org/10.1093/nar/27.1.29

work page doi:10.1093/nar/27.1.29 1999

[12] [12]

Sayers, Richa Agarwala, Evan E

Eric W. Sayers, Richa Agarwala, Evan E. Bolton, et al. 2019. Database resources of the National Center for Biotechnology Information. Nucleic Acids Research 47, D1 (jan 2019), D23–D28. https://doi.org/10.1093/nar/gky1069

work page doi:10.1093/nar/gky1069 2019

[13] [13]

Peter W J Staar, Michele Dolfi, Christoph Auer, et al. 2018. Corpus Conversion Service. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining - KDD ’18 . ACM Press, New York, New York, USA, 774–782. https://doi.org/10.1145/3219819.3219834

work page doi:10.1145/3219819.3219834 2018

[14] [14]

Neil Swainston, Riza Batista-Navarro, Pablo Carbonell, et al. 2017. biochem4j: Integrated and extensible biochemical knowledge through graph databases. PLoS ONE 12, 7 (jul 2017), e0179130. https://doi.org/10.1371/journal.pone.0179130

work page doi:10.1371/journal.pone.0179130 2017

[15] [15]

The UniProt Consortium. 2018. UniProt: a worldwide hub of protein knowledge. Nucleic Acids Research 47, D1 (jan 2018), D506–D515. https://doi.org/10.1093/nar/gky1049

work page doi:10.1093/nar/gky1049 2018

[16] [16]

Kevin J. Yarema. 2010. Handbook of Carbohydrate Engineering . Taylor & Francis. 904 pages. https://doi.org/10.1201/ 9781420027631 4

work page 2010