KadiAssistant: A conversational AI Agent for information retrieval in Kadi4Mat

Adrian Cierpka; Arnd Koeppe; Eric Dietriche Sesso Domtchoueng; Johannes Steinh\"ulb; Michael Selzer; Mohammad Shafiqul Islam

arxiv: 2605.18850 · v1 · pith:276PHN3Jnew · submitted 2026-05-13 · 💻 cs.IR · cs.AI

KadiAssistant: A conversational AI Agent for information retrieval in Kadi4Mat

Adrian Cierpka , Mohammad Shafiqul Islam , Johannes Steinh\"ulb , Eric Dietriche Sesso Domtchoueng , Michael Selzer , Arnd Koeppe This is my paper

Pith reviewed 2026-05-20 21:33 UTC · model grok-4.3

classification 💻 cs.IR cs.AI

keywords KadiAssistantconversational AIinformation retrievalprivacy-preserving searchresearch data managementKadi4Matsemantic searchretrieval-augmented generation

0 comments

The pith

KadiAssistant pairs a self-hosted LLM with privacy-preserving semantic search to let researchers query and synthesize access-controlled data in Kadi4Mat.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

KadiAssistant is a conversational AI agent built into the Kadi research data platform to help scientists retrieve, combine, and summarize information from scattered, heterogeneous datasets. In fields such as materials science, data arrive in incompatible formats and vocabularies while some records must remain private. The system meets these constraints by running a local large language model together with a semantic search layer that respects each record's fine-grained permissions. If the approach works, researchers can obtain structured answers without mastering every metadata schema or risking unauthorized disclosure. The design also advances the findability goal of FAIR data principles by lowering barriers to relevant private and public records alike.

Core claim

KadiAssistant combines a self-hosted large language model with a privacy-preserving semantic search, inspired by retrieval-augmented generation, that can access files and record metadata on Kadi. This allows the assistant to screen, aggregate, and structure information into a highly informative answer while respecting fine-grained access permissions and bridging terminology differences across disciplines.

What carries the argument

Self-hosted LLM integrated with privacy-preserving semantic search that reads Kadi files and metadata under enforced access rules.

If this is right

Researchers gain the ability to retrieve information from distributed records without prior mastery of metadata schemas or domain-specific interfaces.
Fine-grained permissions on sensitive warm data remain intact during search and synthesis operations.
Terminology and format differences across disciplines are bridged so that answers can draw from electrochemical, simulation, and manufacturing records alike.
The findable pillar of FAIR principles is strengthened for both private and published data in the Kadi ecosystem.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same privacy-first retrieval pattern could be applied to other research data platforms that store mixed-sensitivity records.
Widespread adoption might encourage metadata standardization because accurate aggregation depends on consistent terminology.
Self-hosting the language model keeps raw data inside institutional boundaries, offering a template for privacy-sensitive scientific assistants elsewhere.

Load-bearing premise

The semantic search can reliably enforce fine-grained access permissions across heterogeneous data formats and terminologies while still producing accurate aggregated answers.

What would settle it

A user query that returns content from a record the user is not permitted to see, or that misses relevant records because of mismatched terminology in the permission layer.

Figures

Figures reproduced from arXiv: 2605.18850 by Adrian Cierpka, Arnd Koeppe, Eric Dietriche Sesso Domtchoueng, Johannes Steinh\"ulb, Michael Selzer, Mohammad Shafiqul Islam.

**Figure 2.** Figure 2: The Overall architecture of the system, consisting of 6 components. The KadiWeb UI runs on the user’s [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Flow diagram for the semantic similarity search. [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 5.** Figure 5: The state graph of our AI Agent. The entry point is at [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: KadiAssistants user interface integrated into KadiWeb. The AI Assistant’s response includes the answer, [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: An example use case for the LISA-Replica dataset, showing the retrieval of a relevant document. [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗

**Figure 8.** Figure 8: An example use case for the POLiS Ontology dataset necessitating multiple tool calls. [PITH_FULL_IMAGE:figures/full_fig_p011_8.png] view at source ↗

**Figure 9.** Figure 9: An example task for KadiAssistant on the ML-Breathing-Detection dataset. System messages are not shown [PITH_FULL_IMAGE:figures/full_fig_p012_9.png] view at source ↗

**Figure 10.** Figure 10: Average execution time (100 trials per datapoint) of the k-NN search (k=50) with increasing number of [PITH_FULL_IMAGE:figures/full_fig_p013_10.png] view at source ↗

**Figure 11.** Figure 11: Average execution time (500 trials for each datapoint) of the k-NN search (with k=50) for a growing number [PITH_FULL_IMAGE:figures/full_fig_p014_11.png] view at source ↗

**Figure 12.** Figure 12: Memory growth of the HNSW index and the total vector table, including HNSW and TOAST data for 4 [PITH_FULL_IMAGE:figures/full_fig_p015_12.png] view at source ↗

read the original abstract

We introduce KadiAssistant, a privacy-by-design AI assistant integrated into the Kadi research data ecosystem, enabling researchers to efficiently access, aggregate, and synthesize information from heterogeneous, privacy-sensitive research data. Interdisciplinary fields such as materials science bring together disciplines with their own terminology and standards. While this convergence fuels innovation, it also makes it increasingly difficult to connect and access knowledge, as data are distributed across disciplines, organizations, and individuals. For example, battery research combines electrochemical measurements, materials characterization data, physics-based simulations, and manufacturing parameters, each using different formats, vocabularies, and standards. Efficiently storing and sharing such heterogeneous data via research data platforms, such as Kadi4Mat, demands domain knowledge, technical expertise, and familiarity with metadata schemas and interfaces. Research data also vary in sensitivity: newly generated 'warm' data are often private, whereas published 'cold' data are usually openly accessible. The Kadi ecosystem offers fine-grained access control needed for sensitive data. A solution for efficient information retrieval in Kadi must therefore respect the fine-grained access permissions. To address these intertwined challenges of information retrieval, strong data privacy, and complex access control, KadiAssistant combines a self-hosted large language model (LLM) with a privacy-preserving semantic search, inspired by retrieval-augmented generation, that can access files and record metadata on Kadi. This allows the assistant to screen, aggregate, and structure information into a highly informative answer. KadiAssistant therefore bridges terminology and standards, lowers access barriers for researchers, and strengthens the Findable pillar of FAIR data principles.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The manuscript introduces KadiAssistant, a privacy-by-design conversational AI agent for information retrieval in the Kadi4Mat research data management platform. It describes an architecture combining a self-hosted LLM with a privacy-preserving semantic search (RAG-inspired) that accesses files and record metadata to screen, aggregate, and structure heterogeneous research data (e.g., in battery research) while respecting fine-grained access permissions and supporting the Findable aspect of FAIR principles.

Significance. If the described components function as claimed, the system could meaningfully reduce barriers to synthesizing interdisciplinary research data under strict privacy constraints, advancing practical use of research data platforms. The self-hosted LLM choice and emphasis on permission-respecting retrieval are positive design choices for sensitive domains. However, significance remains prospective given the absence of any implementation, metrics, or validation.

major comments (2)

[Architecture description] The description of the privacy-preserving semantic search (in the paragraph beginning 'To address these intertwined challenges...'): the mechanism for integrating fine-grained access permissions into retrieval is unspecified (e.g., no mention of query-time filtering, per-user embeddings, or post-retrieval gating). This is load-bearing for the central claim that the assistant retrieves only permitted data while still producing accurate aggregated answers across heterogeneous formats and terminologies.
[Overall manuscript] No evaluation or implementation section: the manuscript provides no metrics, error analysis, or tests on mixed-permission heterogeneous datasets, leaving the claims about reliable screening/aggregation and privacy enforcement unverified.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback on the manuscript. We address each major comment below, indicating the revisions we will incorporate to clarify the architecture and strengthen the presentation of the work.

read point-by-point responses

Referee: [Architecture description] The description of the privacy-preserving semantic search (in the paragraph beginning 'To address these intertwined challenges...'): the mechanism for integrating fine-grained access permissions into retrieval is unspecified (e.g., no mention of query-time filtering, per-user embeddings, or post-retrieval gating). This is load-bearing for the central claim that the assistant retrieves only permitted data while still producing accurate aggregated answers across heterogeneous formats and terminologies.

Authors: We agree that the current description leaves the permission integration mechanism underspecified. In the revised manuscript we will expand the relevant paragraph to state that Kadi4Mat's native fine-grained access control is applied at query time: the semantic search component issues retrieval requests under the authenticated user's session, so that only records and files the user is permitted to access are returned before any embedding or LLM processing occurs. This query-time filtering respects existing permissions without requiring per-user embeddings or post-retrieval gating, thereby supporting both privacy and the production of accurate aggregated answers from heterogeneous data. revision: yes
Referee: [Overall manuscript] No evaluation or implementation section: the manuscript provides no metrics, error analysis, or tests on mixed-permission heterogeneous datasets, leaving the claims about reliable screening/aggregation and privacy enforcement unverified.

Authors: The manuscript presents the conceptual design and privacy-preserving architecture of KadiAssistant. We acknowledge that empirical validation would strengthen the claims. We will add a dedicated section describing the current prototype implementation status and planned evaluation, including qualitative tests on heterogeneous battery-research datasets that respect mixed permissions and metrics such as retrieval relevance and permission adherence. Comprehensive quantitative error analysis and large-scale validation remain part of planned follow-up work. revision: partial

Circularity Check

0 steps flagged

No circularity: system description with no derivations or fitted results

full rationale

The manuscript is a high-level system description of KadiAssistant that introduces an architecture combining a self-hosted LLM with privacy-preserving semantic search inspired by RAG. No equations, derivations, parameter fittings, or mathematical claims are present in the abstract or described full text. The central design choice is presented as an engineering solution to access-control and terminology challenges rather than a result derived from prior steps within the paper. No self-citations, uniqueness theorems, or ansatzes are invoked to justify any load-bearing claim. This is a standard non-circular descriptive paper whose content is self-contained as an architectural proposal.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

This is a system-description paper rather than a theoretical or empirical derivation; no free parameters, mathematical axioms, or independently evidenced invented entities are introduced beyond the named software artifact itself.

invented entities (1)

KadiAssistant no independent evidence
purpose: Conversational interface for privacy-respecting retrieval in Kadi4Mat
The assistant is the primary contribution but no external benchmarks or independent validation are described in the abstract.

pith-pipeline@v0.9.0 · 5849 in / 1208 out tokens · 90593 ms · 2026-05-20T21:33:17.813860+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

KadiAssistant combines a self-hosted large language model (LLM) with a privacy-preserving semantic search, inspired by retrieval-augmented generation, that can access files and record metadata on Kadi.
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

To add a user access rights filter to the k-NN search, we use the 'record_id' and an existing Kadi4Mat function that retrieves all accessible record IDs.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages · 9 internal anchors

[2]

Harrison Chase and LangChain Contributors.LangChain.URL: https://www.langchain.com/ (visited on 03/25/2026)

work page 2026
[3]

Harrison Chase and LangChain Contributors.LangGraph.URL: https://github.com/langchain- ai/ langgraph(visited on 03/25/2026)

work page 2026
[4]

A survey on privacy risks and protection in large language models

Kang Chen et al. “A survey on privacy risks and protection in large language models”. In:Journal of King Saud University Computer and Information Sciences37.7 (Sept. 2025), p. 163.ISSN: 1319-1578, 2213-1248.DOI: 10.1007/s44443-025-00177-1.URL:https://link.springer.com/10.1007/s44443-025-00177-1

work page doi:10.1007/s44443-025-00177-1.url:https://link.springer.com/10.1007/s44443-025-00177-1 2025
[5]

Open-source embedding database for AI and LLM applications

Chroma.ChromaDB. Open-source embedding database for AI and LLM applications. Chroma.URL: https: //www.trychroma.com/(visited on 05/06/2026)

work page 2026
[6]

May 2026.DOI: 10.5281/zenodo.20072958 .URL: https: //doi.org/10.5281/zenodo.20072958

Adrian Cierpka et al.KadiAssistant Dataset. May 2026.DOI: 10.5281/zenodo.20072958 .URL: https: //doi.org/10.5281/zenodo.20072958

work page doi:10.5281/zenodo.20072958 2026
[7]

Semantic Resources for Managing Knowledge in Battery Research

Simon Clark et al. “Semantic Resources for Managing Knowledge in Battery Research”. In:ChemSusChem 18.16 (Aug. 2025), e202500458.ISSN: 1864-5631, 1864-564X.DOI: 10 . 1002 / cssc . 202500458.URL: https://chemistry-europe.onlinelibrary.wiley.com/doi/10.1002/cssc.202500458

work page doi:10.1002/cssc.202500458 2025
[8]

com/pgvector/pgvector(visited on 05/06/2026)

pgvector contributors.pgvector: Open-source vector similarity search for PostgreSQL.URL: https://github. com/pgvector/pgvector(visited on 05/06/2026)

work page 2026
[9]

Matthijs Douze et al.The Faiss library. 2025. arXiv: 2401.08281 [cs.LG].URL: https://arxiv.org/abs/ 2401.08281

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

European Organization For Nuclear Research and OpenAIRE.Zenodo. en. 2013.DOI: 10.25495/7GXK-RD71. URL:https://www.zenodo.org/

work page doi:10.25495/7gxk-rd71 2013
[11]

Aaron Grattafiori et al.The Llama 3 Herd of Models. Nov. 2024.DOI: 10.48550/arXiv.2407.21783. arXiv: 2407.21783 [cs].URL:http://arxiv.org/abs/2407.21783

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2407.21783 2024
[12]

Julian Grolig et al.FAIR Development of Data-integrated AI to Detect Breathing Motion in Dynamic Lung MRI. Sept. 2023.DOI:10.5281/zenodo.8366576.URL:https://doi.org/10.5281/zenodo.8366576

work page doi:10.5281/zenodo.8366576.url:https://doi.org/10.5281/zenodo.8366576 2023
[13]

Accelerated data-driven materials science with the Materials Project

Matthew K. Horton et al. “Accelerated data-driven materials science with the Materials Project”. In:Nature Materials24.10 (Oct. 2025), pp. 1522–1532.ISSN: 1476-1122, 1476-4660.DOI: 10.1038/s41563- 025- 02272-0.URL:https://www.nature.com/articles/s41563-025-02272-0

work page doi:10.1038/s41563- 2025
[14]

P., Hautier, G., Chen, W., Richards, W

Anubhav Jain et al. “Commentary: The Materials Project: A materials genome approach to accelerating materials innovation”. In:APL Materials1.1 (July 2013), p. 011002.ISSN: 2166-532X.DOI: 10.1063/1.4812323. eprint: https://pubs.aip.org/aip/apm/article- pdf/doi/10.1063/1.4812323/13163869/011002_1_ online.pdf.URL:https://doi.org/10.1063/1.4812323. 16 APREPRI...

work page doi:10.1063/1.4812323 2013
[15]

User Privacy and Large Language Models: An Analysis of Frontier Developers’ Privacy Policies

Jennifer King et al. “User Privacy and Large Language Models: An Analysis of Frontier Developers’ Privacy Policies”. In:Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society8.2 (Oct. 2025), pp. 1465– 1477.ISSN: 3065-8365.DOI: 10.1609/aies.v8i2.36646 .URL: https://ojs.aaai.org/index.php/ AIES/article/view/36646

work page doi:10.1609/aies.v8i2.36646 2025
[16]

Arnd Koeppe and CIDS Contributors.CIDS.URL: https://gitlab.com/intelligent-analysis/cids (visited on 03/26/2026)

work page 2026
[17]

Woosuk Kwon et al.Efficient Memory Management for Large Language Model Serving with PagedAttention. Sept. 2023.DOI: 10.48550/arXiv.2309.06180. arXiv: 2309.06180 [cs].URL: http://arxiv.org/abs/ 2309.06180

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2309.06180 2023
[18]

Patrick Lewis et al.Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. Apr. 12, 2021.DOI: 10.48550/arXiv.2005.11401 . arXiv: 2005.11401 [cs] .URL: http://arxiv.org/abs/2005.11401 . Pre-published

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2005.11401 2021
[19]

Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs

Yu A. Malkov and D. A. Yashunin.Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs. Aug. 2018.DOI: 10.48550/arXiv.1603.09320. arXiv: 1603.09320 [cs]. URL:http://arxiv.org/abs/1603.09320

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1603.09320 2018
[20]

https://huggingface.co/meta- llama/Meta- Llama- 3- 70B- Instruct

Meta AI.Meta-Llama-3-70B-Instruct. https://huggingface.co/meta- llama/Meta- Llama- 3- 70B- Instruct. 2024. (Visited on 03/24/2026)

work page 2024
[21]

Enhancing Multiscale Simulation Data Management with Domain Ontologies and an ELN: Addressing Challenges and Implementing Strategies

Hafiz Muhammad Noman and Michael Selzer. “Enhancing Multiscale Simulation Data Management with Domain Ontologies and an ELN: Addressing Challenges and Implementing Strategies”. In:Data Science Journal 24 (Sept. 2025), p. 28.ISSN: 1683-1470.DOI: 10.5334/dsj- 2025- 028 .URL: https://datascience. codata.org/articles/10.5334/dsj-2025-028/

work page doi:10.5334/dsj- 2025
[22]

Large language model chatbot developed by OpenAI

OpenAI.ChatGPT. Large language model chatbot developed by OpenAI. First released November 2022. OpenAI, 2022.URL:https://chat.openai.com/(visited on 05/06/2026)

work page 2022
[23]

OpenAI et al.gpt-oss-120b & gpt-oss-20b Model Card. 2025. arXiv: 2508.10925 [cs.CL] .URL: https: //arxiv.org/abs/2508.10925

work page internal anchor Pith review Pith/arXiv arXiv 2025
[24]

AI-powered search and chat platform developed by Perplexity AI

Perplexity AI.Perplexity.ai. AI-powered search and chat platform developed by Perplexity AI. Perplexity AI. URL:https://www.perplexity.ai/(visited on 05/06/2026)

work page 2026
[25]

Open-source vector similarity search engine

Qdrant.Qdrant Vector Database. Open-source vector similarity search engine. Qdrant.URL: https://qdrant. tech/(visited on 05/06/2026)

work page 2026
[26]

Data-Driven Virtual Material Analysis and Synthesis for Solid Electrolyte Inter- phases

Deepalaxmi Rajagopal et al. “Data-Driven Virtual Material Analysis and Synthesis for Solid Electrolyte Inter- phases”. In:Advanced Energy Materials13.40 (Oct. 2023), p. 2301985.ISSN: 1614-6832, 1614-6840.DOI: 10.1002/aenm.202301985.URL: https://advanced.onlinelibrary.wiley.com/doi/10.1002/aenm. 202301985

work page doi:10.1002/aenm.202301985.url: 2023
[27]

The NOMAD Artificial-Intelligence Toolkit: turning materials-science data into knowledge and understanding

Luigi Sbailò et al. “The NOMAD Artificial-Intelligence Toolkit: turning materials-science data into knowledge and understanding”. In:npj Computational Materials8.1 (Dec. 5, 2022), p. 250.ISSN: 2057-3960.DOI: 10. 1038/s41524-022-00935-z.URL:https://www.nature.com/articles/s41524-022-00935-z

work page 2022
[28]

NOMAD: A distributed web-based platform for managingmaterials science research data

Markus Scheidgen et al. “NOMAD: A distributed web-based platform for managingmaterials science research data”. In:Journal of Open Source Software8.90 (Oct. 15, 2023), p. 5388.ISSN: 2475-9066.DOI: 10.21105/ joss.05388.URL:https://joss.theoj.org/papers/10.21105/joss.05388

work page doi:10.21105/joss.05388 2023
[29]

2024.DOI:10.25446/OXFORD.25231847.V2.URL: https://portal.sds

Damon Strange and Megan Gooch.Taking the temperature - exploring the lifecycle of research data using the ‘hot’, ‘warm’ or ‘cold’ metaphor. 2024.DOI:10.25446/OXFORD.25231847.V2.URL: https://portal.sds. ox.ac.uk/articles/conference_contribution/Taking_the_temperature_- _exploring_the_ lifecycle_of_research_data_using_the_hot_warm_or_cold_metaphor/25231847/2

work page doi:10.25446/oxford.25231847.v2.url: 2024
[30]

Version 0.47.0

Kadi4Mat Team and Contributors.kadi-apy. Version 0.47.0. May 2025.DOI: 10.5281/zenodo.15422101 . URL:https://doi.org/10.5281/zenodo.15422101

work page doi:10.5281/zenodo.15422101 2025
[31]

BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models

Nandan Thakur et al.BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models. 2021.DOI:10.48550/ARXIV.2104.08663.URL:https://arxiv.org/abs/2104.08663

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2104.08663.url:https://arxiv.org/abs/2104.08663 2021
[32]

2025.DOI: 10.71707/WDQR- 2J50.URL:https://commons.datacite.org/doi.org/10.71707/wdqr-2j50(visited on 02/05/2026)

The Navigation Fund.AIRDEC - Al-assisted Repository DEposit and Curation. 2025.DOI: 10.71707/WDQR- 2J50.URL:https://commons.datacite.org/doi.org/10.71707/wdqr-2j50(visited on 02/05/2026)

work page doi:10.71707/wdqr- 2025
[33]

com / Unstructured - IO / unstructured(visited on 03/25/2026)

Unstructured.io.Unstructured open source library.URL: https : / / github . com / Unstructured - IO / unstructured(visited on 03/25/2026)

work page 2026
[34]

Ashish Vaswani et al.Attention Is All You Need. Aug. 2023.DOI: 10.48550/arXiv.1706.03762 . arXiv: 1706.03762 [cs].URL:http://arxiv.org/abs/1706.03762

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1706.03762 2023
[35]

Wilkinson, Michel Dumontier, IJsbrand Jan Aalbersberg, Gabrielle Appleton, Myles Axton, Arie Baak, Niklas Blomberg, Jan-Willem Boiten, Luiz Bonino da Silva Santos, Philip E

Mark D. Wilkinson et al. “The FAIR Guiding Principles for scientific data management and stewardship”. In:Scientific Data3.1 (Mar. 15, 2016), p. 160018.ISSN: 2052-4463.DOI: 10.1038/sdata.2016.18 .URL: https://www.nature.com/articles/sdata201618. 17 APREPRINT- MAY20, 2026

work page doi:10.1038/sdata.2016.18 2016
[36]

Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models

Yanzhao Zhang et al.Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models. June 2025.DOI: 10.48550/arXiv.2506.05176 . arXiv: 2506.05176 [cs] .URL: http://arxiv. org/abs/2506.05176

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2506.05176 2025
[37]

LISA: A Lithium-Ion Solid-State Assistant using large language models for knowledge defragmentation in battery science and beyond

Yinghan Zhao et al. “LISA: A Lithium-Ion Solid-State Assistant using large language models for knowledge defragmentation in battery science and beyond”. In:Materials Today Communications45 (Apr. 2025), p. 112380. ISSN: 2352-4928.DOI: 10.1016/j.mtcomm.2025.112380 .URL: https://linkinghub.elsevier.com/ retrieve/pii/S235249282500892X. Acknowledgments This wo...

work page doi:10.1016/j.mtcomm.2025.112380 2025

[1] [2]

Harrison Chase and LangChain Contributors.LangChain.URL: https://www.langchain.com/ (visited on 03/25/2026)

work page 2026

[2] [3]

Harrison Chase and LangChain Contributors.LangGraph.URL: https://github.com/langchain- ai/ langgraph(visited on 03/25/2026)

work page 2026

[3] [4]

A survey on privacy risks and protection in large language models

Kang Chen et al. “A survey on privacy risks and protection in large language models”. In:Journal of King Saud University Computer and Information Sciences37.7 (Sept. 2025), p. 163.ISSN: 1319-1578, 2213-1248.DOI: 10.1007/s44443-025-00177-1.URL:https://link.springer.com/10.1007/s44443-025-00177-1

work page doi:10.1007/s44443-025-00177-1.url:https://link.springer.com/10.1007/s44443-025-00177-1 2025

[4] [5]

Open-source embedding database for AI and LLM applications

Chroma.ChromaDB. Open-source embedding database for AI and LLM applications. Chroma.URL: https: //www.trychroma.com/(visited on 05/06/2026)

work page 2026

[5] [6]

May 2026.DOI: 10.5281/zenodo.20072958 .URL: https: //doi.org/10.5281/zenodo.20072958

Adrian Cierpka et al.KadiAssistant Dataset. May 2026.DOI: 10.5281/zenodo.20072958 .URL: https: //doi.org/10.5281/zenodo.20072958

work page doi:10.5281/zenodo.20072958 2026

[6] [7]

Semantic Resources for Managing Knowledge in Battery Research

Simon Clark et al. “Semantic Resources for Managing Knowledge in Battery Research”. In:ChemSusChem 18.16 (Aug. 2025), e202500458.ISSN: 1864-5631, 1864-564X.DOI: 10 . 1002 / cssc . 202500458.URL: https://chemistry-europe.onlinelibrary.wiley.com/doi/10.1002/cssc.202500458

work page doi:10.1002/cssc.202500458 2025

[7] [8]

com/pgvector/pgvector(visited on 05/06/2026)

pgvector contributors.pgvector: Open-source vector similarity search for PostgreSQL.URL: https://github. com/pgvector/pgvector(visited on 05/06/2026)

work page 2026

[8] [9]

Matthijs Douze et al.The Faiss library. 2025. arXiv: 2401.08281 [cs.LG].URL: https://arxiv.org/abs/ 2401.08281

work page internal anchor Pith review Pith/arXiv arXiv 2025

[9] [10]

European Organization For Nuclear Research and OpenAIRE.Zenodo. en. 2013.DOI: 10.25495/7GXK-RD71. URL:https://www.zenodo.org/

work page doi:10.25495/7gxk-rd71 2013

[10] [11]

Aaron Grattafiori et al.The Llama 3 Herd of Models. Nov. 2024.DOI: 10.48550/arXiv.2407.21783. arXiv: 2407.21783 [cs].URL:http://arxiv.org/abs/2407.21783

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2407.21783 2024

[11] [12]

Julian Grolig et al.FAIR Development of Data-integrated AI to Detect Breathing Motion in Dynamic Lung MRI. Sept. 2023.DOI:10.5281/zenodo.8366576.URL:https://doi.org/10.5281/zenodo.8366576

work page doi:10.5281/zenodo.8366576.url:https://doi.org/10.5281/zenodo.8366576 2023

[12] [13]

Accelerated data-driven materials science with the Materials Project

Matthew K. Horton et al. “Accelerated data-driven materials science with the Materials Project”. In:Nature Materials24.10 (Oct. 2025), pp. 1522–1532.ISSN: 1476-1122, 1476-4660.DOI: 10.1038/s41563- 025- 02272-0.URL:https://www.nature.com/articles/s41563-025-02272-0

work page doi:10.1038/s41563- 2025

[13] [14]

P., Hautier, G., Chen, W., Richards, W

Anubhav Jain et al. “Commentary: The Materials Project: A materials genome approach to accelerating materials innovation”. In:APL Materials1.1 (July 2013), p. 011002.ISSN: 2166-532X.DOI: 10.1063/1.4812323. eprint: https://pubs.aip.org/aip/apm/article- pdf/doi/10.1063/1.4812323/13163869/011002_1_ online.pdf.URL:https://doi.org/10.1063/1.4812323. 16 APREPRI...

work page doi:10.1063/1.4812323 2013

[14] [15]

User Privacy and Large Language Models: An Analysis of Frontier Developers’ Privacy Policies

Jennifer King et al. “User Privacy and Large Language Models: An Analysis of Frontier Developers’ Privacy Policies”. In:Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society8.2 (Oct. 2025), pp. 1465– 1477.ISSN: 3065-8365.DOI: 10.1609/aies.v8i2.36646 .URL: https://ojs.aaai.org/index.php/ AIES/article/view/36646

work page doi:10.1609/aies.v8i2.36646 2025

[15] [16]

Arnd Koeppe and CIDS Contributors.CIDS.URL: https://gitlab.com/intelligent-analysis/cids (visited on 03/26/2026)

work page 2026

[16] [17]

Woosuk Kwon et al.Efficient Memory Management for Large Language Model Serving with PagedAttention. Sept. 2023.DOI: 10.48550/arXiv.2309.06180. arXiv: 2309.06180 [cs].URL: http://arxiv.org/abs/ 2309.06180

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2309.06180 2023

[17] [18]

Patrick Lewis et al.Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. Apr. 12, 2021.DOI: 10.48550/arXiv.2005.11401 . arXiv: 2005.11401 [cs] .URL: http://arxiv.org/abs/2005.11401 . Pre-published

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2005.11401 2021

[18] [19]

Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs

Yu A. Malkov and D. A. Yashunin.Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs. Aug. 2018.DOI: 10.48550/arXiv.1603.09320. arXiv: 1603.09320 [cs]. URL:http://arxiv.org/abs/1603.09320

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1603.09320 2018

[19] [20]

https://huggingface.co/meta- llama/Meta- Llama- 3- 70B- Instruct

Meta AI.Meta-Llama-3-70B-Instruct. https://huggingface.co/meta- llama/Meta- Llama- 3- 70B- Instruct. 2024. (Visited on 03/24/2026)

work page 2024

[20] [21]

Enhancing Multiscale Simulation Data Management with Domain Ontologies and an ELN: Addressing Challenges and Implementing Strategies

Hafiz Muhammad Noman and Michael Selzer. “Enhancing Multiscale Simulation Data Management with Domain Ontologies and an ELN: Addressing Challenges and Implementing Strategies”. In:Data Science Journal 24 (Sept. 2025), p. 28.ISSN: 1683-1470.DOI: 10.5334/dsj- 2025- 028 .URL: https://datascience. codata.org/articles/10.5334/dsj-2025-028/

work page doi:10.5334/dsj- 2025

[21] [22]

Large language model chatbot developed by OpenAI

OpenAI.ChatGPT. Large language model chatbot developed by OpenAI. First released November 2022. OpenAI, 2022.URL:https://chat.openai.com/(visited on 05/06/2026)

work page 2022

[22] [23]

OpenAI et al.gpt-oss-120b & gpt-oss-20b Model Card. 2025. arXiv: 2508.10925 [cs.CL] .URL: https: //arxiv.org/abs/2508.10925

work page internal anchor Pith review Pith/arXiv arXiv 2025

[23] [24]

AI-powered search and chat platform developed by Perplexity AI

Perplexity AI.Perplexity.ai. AI-powered search and chat platform developed by Perplexity AI. Perplexity AI. URL:https://www.perplexity.ai/(visited on 05/06/2026)

work page 2026

[24] [25]

Open-source vector similarity search engine

Qdrant.Qdrant Vector Database. Open-source vector similarity search engine. Qdrant.URL: https://qdrant. tech/(visited on 05/06/2026)

work page 2026

[25] [26]

Data-Driven Virtual Material Analysis and Synthesis for Solid Electrolyte Inter- phases

Deepalaxmi Rajagopal et al. “Data-Driven Virtual Material Analysis and Synthesis for Solid Electrolyte Inter- phases”. In:Advanced Energy Materials13.40 (Oct. 2023), p. 2301985.ISSN: 1614-6832, 1614-6840.DOI: 10.1002/aenm.202301985.URL: https://advanced.onlinelibrary.wiley.com/doi/10.1002/aenm. 202301985

work page doi:10.1002/aenm.202301985.url: 2023

[26] [27]

The NOMAD Artificial-Intelligence Toolkit: turning materials-science data into knowledge and understanding

Luigi Sbailò et al. “The NOMAD Artificial-Intelligence Toolkit: turning materials-science data into knowledge and understanding”. In:npj Computational Materials8.1 (Dec. 5, 2022), p. 250.ISSN: 2057-3960.DOI: 10. 1038/s41524-022-00935-z.URL:https://www.nature.com/articles/s41524-022-00935-z

work page 2022

[27] [28]

NOMAD: A distributed web-based platform for managingmaterials science research data

Markus Scheidgen et al. “NOMAD: A distributed web-based platform for managingmaterials science research data”. In:Journal of Open Source Software8.90 (Oct. 15, 2023), p. 5388.ISSN: 2475-9066.DOI: 10.21105/ joss.05388.URL:https://joss.theoj.org/papers/10.21105/joss.05388

work page doi:10.21105/joss.05388 2023

[28] [29]

2024.DOI:10.25446/OXFORD.25231847.V2.URL: https://portal.sds

Damon Strange and Megan Gooch.Taking the temperature - exploring the lifecycle of research data using the ‘hot’, ‘warm’ or ‘cold’ metaphor. 2024.DOI:10.25446/OXFORD.25231847.V2.URL: https://portal.sds. ox.ac.uk/articles/conference_contribution/Taking_the_temperature_- _exploring_the_ lifecycle_of_research_data_using_the_hot_warm_or_cold_metaphor/25231847/2

work page doi:10.25446/oxford.25231847.v2.url: 2024

[29] [30]

Version 0.47.0

Kadi4Mat Team and Contributors.kadi-apy. Version 0.47.0. May 2025.DOI: 10.5281/zenodo.15422101 . URL:https://doi.org/10.5281/zenodo.15422101

work page doi:10.5281/zenodo.15422101 2025

[30] [31]

BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models

Nandan Thakur et al.BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models. 2021.DOI:10.48550/ARXIV.2104.08663.URL:https://arxiv.org/abs/2104.08663

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2104.08663.url:https://arxiv.org/abs/2104.08663 2021

[31] [32]

2025.DOI: 10.71707/WDQR- 2J50.URL:https://commons.datacite.org/doi.org/10.71707/wdqr-2j50(visited on 02/05/2026)

The Navigation Fund.AIRDEC - Al-assisted Repository DEposit and Curation. 2025.DOI: 10.71707/WDQR- 2J50.URL:https://commons.datacite.org/doi.org/10.71707/wdqr-2j50(visited on 02/05/2026)

work page doi:10.71707/wdqr- 2025

[32] [33]

com / Unstructured - IO / unstructured(visited on 03/25/2026)

Unstructured.io.Unstructured open source library.URL: https : / / github . com / Unstructured - IO / unstructured(visited on 03/25/2026)

work page 2026

[33] [34]

Ashish Vaswani et al.Attention Is All You Need. Aug. 2023.DOI: 10.48550/arXiv.1706.03762 . arXiv: 1706.03762 [cs].URL:http://arxiv.org/abs/1706.03762

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1706.03762 2023

[34] [35]

Wilkinson, Michel Dumontier, IJsbrand Jan Aalbersberg, Gabrielle Appleton, Myles Axton, Arie Baak, Niklas Blomberg, Jan-Willem Boiten, Luiz Bonino da Silva Santos, Philip E

Mark D. Wilkinson et al. “The FAIR Guiding Principles for scientific data management and stewardship”. In:Scientific Data3.1 (Mar. 15, 2016), p. 160018.ISSN: 2052-4463.DOI: 10.1038/sdata.2016.18 .URL: https://www.nature.com/articles/sdata201618. 17 APREPRINT- MAY20, 2026

work page doi:10.1038/sdata.2016.18 2016

[35] [36]

Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models

Yanzhao Zhang et al.Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models. June 2025.DOI: 10.48550/arXiv.2506.05176 . arXiv: 2506.05176 [cs] .URL: http://arxiv. org/abs/2506.05176

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2506.05176 2025

[36] [37]

LISA: A Lithium-Ion Solid-State Assistant using large language models for knowledge defragmentation in battery science and beyond

Yinghan Zhao et al. “LISA: A Lithium-Ion Solid-State Assistant using large language models for knowledge defragmentation in battery science and beyond”. In:Materials Today Communications45 (Apr. 2025), p. 112380. ISSN: 2352-4928.DOI: 10.1016/j.mtcomm.2025.112380 .URL: https://linkinghub.elsevier.com/ retrieve/pii/S235249282500892X. Acknowledgments This wo...

work page doi:10.1016/j.mtcomm.2025.112380 2025