Recognition: unknown
Construction of a Battery Research Knowledge Graph using a Global Open Catalog
Pith reviewed 2026-05-10 00:42 UTC · model grok-4.3
The pith
A pipeline creates weighted research descriptor vectors for battery authors using OpenAlex and AI-extracted keyphrases to build an interoperable knowledge graph.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We present a pipeline that constructs an author-centric knowledge graph of battery research from OpenAlex by deriving for each author a weighted vector of research descriptors. The vector combines coarse OpenAlex concepts with fine-grained keyphrases extracted via KeyBERT using ChatGPT as backend, with weights based on origin, authorship position, and recency. Applied to 189,581 works, the vectors enable author similarity computation, community detection, and browser-based exploratory search. The graph is serialized in RDF and linked to Wikidata identifiers for interoperability and extensibility.
What carries the argument
Weighted research descriptors vector combining OpenAlex concepts and KeyBERT/ChatGPT keyphrases, adjusted by authorship position and recency.
If this is right
- Author-author similarity computation becomes possible from the vectors.
- Community detection can identify groups of researchers with overlapping expertise.
- Exploratory search is supported through a browser-based interface.
- The knowledge graph can be serialized in RDF format.
- It links to Wikidata for interoperability with external linked open data sources.
Where Pith is reading between the lines
- The semantic grounding of similarities may identify potential collaborators missed by citation-based methods.
- The approach is extensible to other research domains beyond battery science.
- Integration with Wikidata opens possibilities for combining expertise data with institutional or funding information.
Load-bearing premise
The weighted combination of OpenAlex concepts and KeyBERT/ChatGPT-extracted keyphrases adjusted by authorship position and recency produces vectors that meaningfully represent an author's research expertise.
What would settle it
An experiment where battery domain experts rate the accuracy of suggested similar authors or community groupings derived from the vectors.
read the original abstract
Battery research is a rapidly growing and highly interdisciplinary field, making it increasingly difficult to track relevant expertise and identify potential collaborators across institutional boundaries. In this work, we present a pipeline for constructing an author-centric knowledge graph of battery research built on OpenAlex, a large-scale open bibliographic catalogue. For each author, we derive a weighted research descriptors vector that combines coarse-grained OpenAlex concepts with fine-grained keyphrases extracted from titles and abstracts using KeyBERT with ChatGPT (gpt-3.5-turbo) as the backend model, selected after evaluating multiple alternatives. Vector components are weighted by research descriptor origin, authorship position, and temporal recency. The framework is applied to a corpus of 189,581 battery-related works. The resulting vectors support author-author similarity computation, community detection, and exploratory search through a browser-based interface. The knowledge graph is then serialized in RDF and linked to Wikidata identifiers, making it interoperable with external linked open data sources and extensible beyond the battery domain. Unlike prior author-centric analyses confined to institutional repositories, our approach operates at cross-institutional scale and grounds similarity in domain semantics rather than citation or co-authorship structure alone.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper outlines a pipeline for building an author-centric knowledge graph in battery research using data from the OpenAlex catalog. For 189,581 papers, it creates weighted vectors combining OpenAlex concepts with keyphrases from KeyBERT and gpt-3.5-turbo, adjusted for authorship position and recency. These vectors are used for similarity, community detection, and search, with the graph serialized in RDF and linked to Wikidata identifiers for broader interoperability.
Significance. The construction of a large-scale, open, RDF-linked knowledge graph for battery research expertise represents a practical contribution to the field, particularly given the use of a global open catalog (OpenAlex) and the interoperability with Wikidata. If the vector representations are shown to be effective, this could facilitate cross-institutional collaboration in a rapidly evolving domain. The paper explicitly mentions evaluating multiple keyphrase extraction alternatives, which is a strength.
major comments (1)
- [Abstract] The central claim that the weighted combination of OpenAlex concepts and KeyBERT/ChatGPT-extracted keyphrases produces vectors that meaningfully represent research expertise and support valid author-author similarity, community detection, and exploratory search (Abstract) lacks supporting evidence: there are no ablation studies on the weighting scheme (authorship position and recency), no quantitative metrics for similarity accuracy (e.g., against known co-authorship overlaps), no comparison to citation-based baselines, and no error analysis or validation of the community detection results. This evaluation gap is load-bearing for the paper's assertion of utility.
Simulated Author's Rebuttal
We thank the referee for their constructive comments and for recognizing the practical contribution of our work. We address the major comment below.
read point-by-point responses
-
Referee: [Abstract] The central claim that the weighted combination of OpenAlex concepts and KeyBERT/ChatGPT-extracted keyphrases produces vectors that meaningfully represent research expertise and support valid author-author similarity, community detection, and exploratory search (Abstract) lacks supporting evidence: there are no ablation studies on the weighting scheme (authorship position and recency), no quantitative metrics for similarity accuracy (e.g., against known co-authorship overlaps), no comparison to citation-based baselines, and no error analysis or validation of the community detection results. This evaluation gap is load-bearing for the paper's assertion of utility.
Authors: We agree that the current manuscript does not include quantitative evaluations such as ablation studies on the weighting scheme, metrics for similarity accuracy against co-authorship data, comparisons to citation-based baselines, or validation of the community detection results. The paper focuses on the construction pipeline, the selection of keyphrase extraction methods after evaluating alternatives, and the resulting KG with an interface for exploratory use. While we believe the semantic grounding provides value beyond citation structures, we acknowledge that without these validations, the claims of utility for similarity and community detection are not fully substantiated. In the revised version, we will add an evaluation section including: (1) ablation studies varying the weights for authorship position and recency, (2) quantitative similarity evaluation using metrics like precision at k against known co-author overlaps or held-out papers, (3) comparison to simple citation-based author similarity, and (4) validation of detected communities against external knowledge of battery research groups or manual inspection with error analysis. We will update the abstract to reflect the added evaluations. revision: yes
Circularity Check
No circularity: pipeline uses external OpenAlex data and off-the-shelf NLP tools with explicit weighting rules
full rationale
The manuscript describes an author-vector construction pipeline that ingests bibliographic records from the external OpenAlex catalog, extracts keyphrases via KeyBERT and ChatGPT (chosen after comparing alternatives), and applies deterministic weighting by descriptor origin, authorship position, and recency. The resulting vectors are then used for similarity, community detection, and RDF serialization linked to Wikidata. No equation, parameter fit, or claimed result reduces by construction to a self-referential input; the central assertions rest on the external data source and standard tools rather than any closed loop or self-citation load-bearing step. This is the normal non-circular case for a data-processing pipeline paper.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption OpenAlex provides sufficiently complete coverage of battery-related publications
- domain assumption KeyBERT combined with ChatGPT produces keyphrases that accurately capture fine-grained research descriptors
Reference graph
Works this paper leans on
-
[1]
: The 2021 battery technology roadmap
Ma, J., Li, Y., Grundish, N.S., Goodenough, J.B., Chen, Y., Guo, L., Peng, Z., Qi, X., Yang, F., Qie, L., et al. : The 2021 battery technology roadmap. Journal of Physics D: Applied Physics 54(18), 183001 (2021)
2021
-
[2]
Clark, S., Battaglia, C., Castelli, I.E., Flores, E., Gold, L., Punckt, C., Stier, S., Veit, P.: Semantic resources for managing knowledge in battery research. ChemSusChem 18(16), 202500458 13 A u t h o r N o d e – K n o w l e d g e G r a p h S c h e m a a f f i l i a t i o n O R C I D n o _ p u b l i c a t i o n s _ f i . . . t o t a l _ n o _ p u b l i ...
-
[3]
: A materials terminology knowledge graph automatically 14 constructed from text corpus
Zhang, Y., Chen, F., Liu, Z., Ju, Y., Cui, D., Zhu, J., Jiang, X., Guo, X., He, J., Zhang, L., et al. : A materials terminology knowledge graph automatically 14 constructed from text corpus. Scientific Data 11(1), 600 (2024)
2024
-
[4]
OpenAlex: A fully-open index of scholarly works, authors, venues, institutions, and concepts
Priem, J., Piwowar, H., Orr, R.: Openalex: A fully-open index of scholarly works, authors, venues, institutions, and concepts. arXiv preprint arXiv:2205.01833 (2022)
work page internal anchor Pith review arXiv 2022
-
[5]
Grootendorst, M.: KeyBERT: Minimal keyword extraction with BERT. Zen- odo (2020). https://doi.org/10.5281/zenodo.4461265 . https://doi.org/10.5281/ zenodo.4461265
-
[6]
: Rdf primer
Manola, F., Miller, E., McBride, B., et al. : Rdf primer. W3C recommendation 10(1-107), 6 (2004)
2004
-
[7]
Foppiano, L., Castro, P., Suarez, P., Terashima, K., Takano, Y., Ishii, M.: Automatic extraction of materials and properties from superconductors scien- tific literature. Science and Technology of Advanced Materials Methods 3 (2023) https://doi.org/10.1080/27660400.2022.2153633
-
[8]
npj Com- putational Materials 6(1), 18 (2020) https://doi.org/10.1038/s41524-020-0287-8
Court, C.J., Cole, J.M.: Magnetic and superconducting phase diagrams and tran- sition temperatures predicted using text mining and machine learning. npj Com- putational Materials 6(1), 18 (2020) https://doi.org/10.1038/s41524-020-0287-8
-
[9]
Scientific Data 6(1), 203 (2019) https://doi.org/10.1038/s41597-019-0224-1
Kononova, O., Huo, H., He, T., Rong, Z., Botari, T., Sun, W., Tshitoyan, V., Ceder, G.: Text-mined dataset of inorganic materials synthesis recipes. Scientific Data 6(1), 203 (2019) https://doi.org/10.1038/s41597-019-0224-1
-
[10]
Scientific Data 7(1), 260 (2020)
Huang, S., Cole, J.M.: A database of battery materials auto-generated using chemdataextractor. Scientific Data 7(1), 260 (2020)
2020
-
[11]
Dieb, T.M., Yoshioka, M., Hara, S.: Automatic Information Extraction of Experi- ments from Nanodevices Development Papers, pp. 42–47. IEEE, Fukuoka, Japan (2012). https://doi.org/10.1109/iiai-aai.2012.18
-
[12]
Dieb, T.M., Yoshioka, M.: Extraction of chemical and drug named entities by ensemble learning using chemical ner tools based on different extraction guidelines. Trans. Mach. Learn. Data Min (2015)
2015
-
[13]
Journal of Information Processing 24(3), 554–564 (2016) https://doi.org/10.2197/ipsjjip.24
Dieb, T.M., Yoshioka, M., Hara, S.: Nadev: An annotated corpus to support information extraction from research papers on nanocrystal devices. Journal of Information Processing 24(3), 554–564 (2016) https://doi.org/10.2197/ipsjjip.24. 554
-
[14]
Foppiano, L., Dieb, S., Suzuki, A., Baptista De Castro, P., Iwasaki, S., Uzuki, A., Echevarria, M.G.E., Meng, Y., Terashima, K., Romary, L., Takano, Y., Ishii, M.: Supermat: construction of a linked annotated dataset from superconductors- related publications. Science and Technology of Advanced Materials: Methods 1(1), 34–44 (2021) https://doi.org/10.1080...
-
[15]
Charnine, M., Tishchenko, L. A. A. amd Kochiev: Visualization of research trend- ing topic prediction: Intelligent method for data analysis. Proceedings of the 31th International Conference on Computer Graphics and Vision. Volume 2 (2021) https://doi.org/10.20948/graphicon-2021-3027-1028-1037
-
[16]
Scientometrics 88, 653–661 (2011) https://doi.org/10
Jamali, H.R., Nikzad, M.: Article title type and its relation with the number of downloads and citations. Scientometrics 88, 653–661 (2011) https://doi.org/10. 1007/s11192-011-0412-z
2011
-
[17]
Scientometrics 121, 1583– 1598 (2019) https://doi.org/10.1007/s11192-019-03241-6
Katsurai, M., Ono, S.: Trendnets: Mapping emerging research trends from dynamic co-word networks via sparse representation. Scientometrics 121, 1583– 1598 (2019) https://doi.org/10.1007/s11192-019-03241-6
-
[18]
Materials Today: Proceedings 45, 5591–5596 (2021) https:// doi.org/10.1016/j.matpr.2021.02.313
Rani, S., Kumar, M.: Topic modeling and its applications in materials science and engineering. Materials Today: Proceedings 45, 5591–5596 (2021) https:// doi.org/10.1016/j.matpr.2021.02.313
-
[19]
Law, J., Zhuo, H.H., He, J., Rong, E.: Ltsg: Latent topical skip-gram for mutu- ally improving topic model and vector representations. Pattern Recognition and Computer Vision, 375–387 (2018) https://doi.org/10.1007/978-3-030-03338-5_ 32
-
[20]
IEEE access 11, 144778–144798 (2023)
Nadim, M., Akopian, D., Matamoros, A.: A comparative assessment of unsuper- vised keyword extraction tools. IEEE access 11, 144778–144798 (2023)
2023
-
[21]
arXiv preprint arXiv:2404.02330 (2024)
Chataut, S., Do, T., Gurung, B.D.S., Aryal, S., Khanal, A., Lushbough, C., Gnimpieba, E.: Comparative study of domain driven terms extraction using large language models. arXiv preprint arXiv:2404.02330 (2024)
-
[22]
In: 2025 IEEE International Confer- ence on Information Reuse and Integration and Data Science (IRI), pp
Jia, X., Roller, C., Wang, C.: Llm-rank: An unsupervised keyword extraction method using local large language models. In: 2025 IEEE International Confer- ence on Information Reuse and Integration and Data Science (IRI), pp. 73–78 (2025). IEEE
2025
-
[23]
: Llm-take: Theme-aware keyword extraction using large language models
Maragheh, R.Y., Fang, C., Irugu, C.C., Parikh, P., Cho, J., Xu, J., Sukumar, S., Patel, M., Korpeoglu, E., Kumar, S., et al. : Llm-take: Theme-aware keyword extraction using large language models. In: 2023 IEEE International Conference on Big Data (BigData), pp. 4318–4324 (2023). IEEE
2023
-
[24]
Xu, J., Shen, S., Li, D., Fu, Y.: A network-embedding based method for author disambiguation. Proceedings of the 27th ACM International Conference on Infor- mation and Knowledge Management (2018) https://doi.org/10.1145/3269206. 3269272
-
[25]
Nie, Z., Liu, Y., Yang, L., Li, S., Pan, F.: Construction and appli- cation of materials knowledge graph based on author disambiguation: 16 Revisiting the evolution of lifepo4. Advanced Energy Materi- als 11(16), 2003580 (2021) https://doi.org/10.1002/aenm.202003580 https://advanced.onlinelibrary.wiley.com/doi/pdf/10.1002/aenm.202003580
-
[26]
Müller, M.: Semantic author name disambiguation with word embeddings. Research and Advanced Technology for Digital Libraries, 300–311 (2017) https: //doi.org/10.1007/978-3-319-67008-9_24
-
[27]
Scientometrics 128(9), 5051–5078 (2023)
Schäfermeier, B., Hirth, J., Hanika, T.: Research topic flows in co-authorship networks. Scientometrics 128(9), 5051–5078 (2023)
2023
-
[28]
Quantitative Science Studies 2, 1511–1528 (2021) https://doi.org/10.1162/qss_a_00170
Ghosal, T., Tiwary, P., Patton, R.M., Stahl, C.C.: Towards establishing a research lineage via identification of significant citations. Quantitative Science Studies 2, 1511–1528 (2021) https://doi.org/10.1162/qss_a_00170
-
[29]
Dieb, S., Amano, K., Tanabe, K., Sato, D., Ishii, M., Tanifuji, M.: Cre- ating research topic map for nims samurai database using natural lan- guage processing approach. Science and Technology of Advanced Materi- als: Methods 1(1), 2–11 (2021) https://doi.org/10.1080/27660400.2021.1899426 https://doi.org/10.1080/27660400.2021.1899426
-
[30]
https://battery2030.eu
BATTERY 2030+ Consortium: BATTERY 2030+: Large-Scale Research Initia- tive. https://battery2030.eu. Accessed: 2026-04-06 (2025)
2030
-
[31]
In: Payne, T.R., Presutti, V., Qi, G., Poveda-Villalón, M., Stoilos, G., Hollink, L., Kaoudi, Z., Cheng, G., Li, J
Färber, M., Lamprecht, D., Krause, J., Aung, L., Haase, P.: Semopenalex: The scientific landscape in 26 billion rdf triples. In: Payne, T.R., Presutti, V., Qi, G., Poveda-Villalón, M., Stoilos, G., Hollink, L., Kaoudi, Z., Cheng, G., Li, J. (eds.) The Semantic Web – ISWC 2023, pp. 94–112. Springer, Cham (2023)
2023
-
[32]
In: Proceedings of the 24th International Conference on World Wide Web, pp
Sinha, A., Shen, Z., Song, Y., Ma, H., Eide, D., Hsu, B.-J.p., Wang, K.: An overview of microsoft academic service (mas) and applications. In: Proceedings of the 24th International Conference on World Wide Web, pp. 243–246. ACM, New York, NY, USA (2015). https://doi.org/10.1145/2740908.2742839
-
[33]
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (2019). https://arxiv. org/abs/1810.04805
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[34]
Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks
Reimers, N., Gurevych, I.: Sentence-bert: Sentence embeddings using siamese bert-networks. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, ??? (2019). https://arxiv.org/abs/1908.10084
work page internal anchor Pith review arXiv 2019
-
[35]
Journal of chemical information and modeling 62(24), 6365–6377 (2022) 17
Huang, S., Cole, J.M.: Batterybert: A pretrained language model for battery database enhancement. Journal of chemical information and modeling 62(24), 6365–6377 (2022) 17
2022
-
[36]
Grobid. GitHub. https://github.com/grobidOrg/grobid
-
[37]
https://github.com/amueller/wordcloud
Mueller, A.C.: Wordcloud (2023). https://github.com/amueller/wordcloud
2023
-
[38]
Liu, J., Kaneko, T., Ock, J.-Y., Kondou, S., Ueno, K., Dokko, K., Sodeyama, K., Watanabe, M.: Distinct differences in li-deposition/dissolution reversibility in sulfolane-based electrolytes depending on li-salt species and their solvation structures. The Journal of Physical Chemistry C 127(12), 5689–5701 (2023) https://doi.org/10.1021/acs.jpcc.2c09040
-
[39]
Electrochemistry 92(2) (2024) https://doi.org/23-0008710
Tomoaki, K., Yui, F., Toshihiko, M., Hiroaki, K., Keitaro, S.: Ether molecule decomposition on mgm2o4 (m = mn, fe, co) spinel surface: A first- principles study. Electrochemistry 92(2) (2024) https://doi.org/23-0008710. 5796/electrochemistry.23-00087
2024
-
[40]
The Journal of Physical Chemistry C 118(26), 14091–14097 (2014) https://doi.org/10.1021/jp501178n
Sodeyama, K., Yamada, Y., Aikawa, K., Yamada, A., Tateyama, Y.: Sacrificial anion reduction mechanism for electrochemical stability improvement in highly concentrated li-salt electrolyte. The Journal of Physical Chemistry C 118(26), 14091–14097 (2014) https://doi.org/10.1021/jp501178n
-
[41]
Physi- cal Chemistry Chemical Physics 20(35), 22585–22591 (2018) https://doi.org/10
Sodeyama, K., Igarashi, Y., Nakayama, T., Tateyama, Y., Okada, M.: Liquid electrolyte informatics using an exhaustive search with linear regression. Physi- cal Chemistry Chemical Physics 20(35), 22585–22591 (2018) https://doi.org/10. 1039/c7cp08280k
2018
-
[42]
https://www.w3.org/RDF/
W3C: Resource Description Framework (RDF). https://www.w3.org/RDF/. Accessed: 2026-01-22
2026
-
[43]
Ellefi, M.B., Bellahsene, Z., Breslin, J.G., Demidova, E., Dietze, S., Szymański, J., Todorov, K.: Rdf dataset profiling – a survey of features, methods, vocabularies and applications. Semantic Web 9(5), 677–705 (2018) https://doi.org/10.3233/ SW-180294 https://journals.sagepub.com/doi/pdf/10.3233/SW-180294
-
[44]
https://librarycarpentry.github.io/lc-wikidata/01-introduction.html
Library Carpentry: What is Wikidata? Introduction to Wikidata for Librarians. https://librarycarpentry.github.io/lc-wikidata/01-introduction.html. Accessed: 2026-01-22
2026
-
[45]
an” identifier to “the
Van Veen, T.: Wikidata: from “an” identifier to “the” identifier. Information technology and libraries 38(2), 72–81 (2019)
2019
-
[46]
Burgstaller-Muehlbacher, S., Waagmeester, A., Mitraka, E., Turner, J., Putman, T., Leong, J., Naik, C., Pavlidis, P., Schriml, L., Good, B.M., Su, A.I.: Wikidata as a semantic framework for the gene wiki initiative. Database 2016, 015 (2016) https://doi.org/ 10.1093/database/baw015 https://academic.oup.com/database/article- pdf/doi/10.1093/database/baw015...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.