BDI-Kit Demo: A Toolkit for Programmable and Conversational Data Harmonization

Christos Koutras; Juliana Freire; Roque Lopez; Yurong Liu

arxiv: 2604.06405 · v1 · submitted 2026-04-07 · 💻 cs.AI · cs.DB

BDI-Kit Demo: A Toolkit for Programmable and Conversational Data Harmonization

Roque Lopez , Yurong Liu , Christos Koutras , Juliana Freire This is my paper

Pith reviewed 2026-05-10 18:30 UTC · model grok-4.3

classification 💻 cs.AI cs.DB

keywords data harmonizationschema matchingvalue matchingPython APIconversational interfacedata integrationAI-assisted refinement

0 comments

The pith

BDI-Kit supplies a Python API for building data harmonization pipelines and an AI chat interface for refining schema and value matches through conversation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Data harmonization is slowed by mismatched schemas, value formats, and domain conventions that block integrative analysis. The paper presents BDI-Kit as an extensible toolkit that supplies schema and value matching through two interfaces. Developers can compose matching primitives into pipelines using the Python API, inspect intermediate results, and reuse transformations. Domain experts can instead describe tasks in natural language to the AI assistant, receive suggestions, and iteratively adjust outputs. If these interfaces work together as described, harmonization becomes accessible to both coders and non-coders without requiring extensive custom code for each new dataset.

Core claim

BDI-Kit is an extensible toolkit for schema and value matching that provides a Python API for developers to programmatically compose primitives, examine outputs, and reuse transformations, together with an AI-assisted chat interface that lets domain experts direct the same capabilities through natural-language dialogue and user-driven refinement of matches.

What carries the argument

Dual complementary interfaces for schema and value matching: a Python API that exposes composable primitives and an AI chat layer that supports iterative natural-language guidance and refinement.

If this is right

Developers can build, inspect, and reuse harmonization steps without rewriting matching logic for each new source.
Domain experts can guide matching and correction through ordinary conversation rather than code.
The toolkit supports iterative validation where automated results are examined and adjusted in either interface.
Transformations created in one session can be saved and applied to additional datasets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the chat interface scales, non-technical analysts could handle routine data integration tasks that currently require data engineers.
The same dual-interface pattern could be applied to related data tasks such as entity resolution or format conversion.
Integration with existing data catalogs or warehouses might allow the toolkit to suggest matches based on prior harmonizations.

Load-bearing premise

That automated matching plus AI suggestions plus user refinement will reliably produce accurate schema and value alignments across varied real-world datasets with only modest manual effort.

What would settle it

Apply the toolkit to a collection of heterogeneous datasets from different domains, then measure the number of manual corrections or domain-specific rules still required to reach high match accuracy.

Figures

Figures reproduced from arXiv: 2604.06405 by Christos Koutras, Juliana Freire, Roque Lopez, Yurong Liu.

**Figure 2.** Figure 2: Results of calling match_schema() and match_values() functions via the Python API. natural-language requests, selects appropriate primitives, and executes them on behalf of the user. Importantly, the agent does not replace BDI-Kit’s logic; instead, it acts as an orchestration and explanation layer. To ensure safe and reliable harmonization, the system enforces guardrails: all automated suggestions report… view at source ↗

**Figure 3.** Figure 3: BDI-Kit accessed through an AI agent orchestrating schema and value matching. [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

read the original abstract

Data harmonization remains a major bottleneck for integrative analysis due to heterogeneity in schemas, value representations, and domain-specific conventions. BDI-Kit provides an extensible toolkit for schema and value matching. It exposes two complementary interfaces tailored to different user needs: a Python API enabling developers to construct harmonization pipelines programmatically, and an AI-assisted chat interface allowing domain experts to harmonize data through natural language dialogue. This demonstration showcases how users interact with BDI-Kit to iteratively explore, validate, and refine schema and value matches through a combination of automated matching, AI-assisted reasoning, and user-driven refinement. We present two scenarios: (i) using the Python API to programmatically compose primitives, examine intermediate outputs, and reuse transformations; and (ii) conversing with the AI assistant in natural language to access BDI-Kit's capabilities and iteratively refine outputs based on the assistant's suggestions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

BDI-Kit is a clean demo of a data harmonization toolkit with Python and chat interfaces, but it supplies no evidence that the combination improves results or reduces effort.

read the letter

The paper describes BDI-Kit, a toolkit for schema and value matching that gives users two ways in: a Python API for building and inspecting pipelines, and a conversational AI layer for natural-language interaction and refinement. The core offering is the pairing of these interfaces around standard matching primitives plus user-driven iteration. That pairing is the only real engineering move here, and the two scenarios lay out how a developer would compose steps in code versus how a domain expert would steer the process through dialogue. The descriptions are straightforward and make the intended workflow easy to follow. The paper does a reasonable job showing how automated matching, AI suggestions, and manual tweaks are meant to fit together in practice. Beyond that, there is little new. No new matching algorithms appear, and the work stays at the level of packaging existing techniques. The main shortcoming is the total absence of evaluation. The manuscript offers no accuracy figures, no timing data, no user studies, and no head-to-head comparisons against other harmonization tools or libraries. Without those numbers, any claim that the dual interface actually helps with real heterogeneity rests on the demo examples alone. The assumption that the AI-assisted path will cut manual work without extra tuning is left unexamined. This is a tool paper aimed at practitioners who already do data integration and want a ready-made set of interfaces. Readers hunting for methodological advances or reproducible results will find the contribution thin. I would not bring it to a research-focused reading group. It might warrant a quick look if you are implementing something similar, but I would not cite the work itself. Skip peer review for a standard research track; this belongs at most in a demo or tool session if the venue has one.

Referee Report

0 major / 2 minor

Summary. The paper describes BDI-Kit, an extensible toolkit for schema and value matching to address data harmonization bottlenecks caused by heterogeneous schemas, value representations, and domain conventions. It exposes a Python API for developers to programmatically compose harmonization pipelines and primitives, along with an AI-assisted chat interface enabling domain experts to perform harmonization through natural language dialogue. The demonstration presents two usage scenarios: (i) programmatic composition, inspection of intermediate outputs, and reuse of transformations via the Python API; and (ii) iterative exploration, validation, and refinement of matches via conversational interaction with the AI assistant.

Significance. If implemented as described, BDI-Kit provides a practical dual-interface approach to data harmonization that combines automated matching with user-driven refinement, addressing a common bottleneck in integrative data analysis. The separation of a developer-focused Python API and an expert-facing conversational interface is a clear strength, allowing both technical pipeline construction and accessible natural-language interaction. As a software demonstration paper, the contribution is primarily in the description of the system's design and interaction patterns rather than in empirical performance results.

minor comments (2)

The abstract and introduction would benefit from a brief statement clarifying that the paper is a demonstration of interfaces and scenarios rather than a quantitative evaluation of matching accuracy or effort reduction.
Figure or code examples illustrating the Python API primitives and the chat interface dialogue flow would improve clarity for readers attempting to understand the composition of pipelines.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive evaluation of the manuscript and for recommending acceptance. The referee's summary correctly identifies the core contribution of BDI-Kit as a dual-interface toolkit combining a programmable Python API with an AI-assisted conversational interface for data harmonization.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper is a purely descriptive software demonstration of BDI-Kit, detailing its Python API for composing harmonization primitives and an AI chat interface for natural-language interaction. It presents usage scenarios but contains no equations, derivations, fitted parameters, predictions, or load-bearing self-citations. All claims are direct descriptions of implemented interfaces and observed interactions, with no reduction of any result to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper introduces no free parameters, mathematical axioms, or invented entities; it is a description of an implemented software toolkit for data harmonization.

pith-pipeline@v0.9.0 · 5452 in / 1054 out tokens · 53542 ms · 2026-05-10T18:30:56.674576+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

10 extracted references · 10 canonical work pages

[1]

Cafarella, Alon Halevy, and Nodira Khoussainova

Michael J. Cafarella, Alon Halevy, and Nodira Khoussainova. 2009. Data Integra- tion for the Relational Web.Proceedings of the VLDB Endowment (PVLDB)2, 1 (2009), 1090–1101

work page 2009
[2]

Liwei Cao, Chen Huang, Daniel Cui Zhou, Yingwei Hu, Mamie Lih, Sara Savage, Karsten Krug, David Clark, et al . 2021. Proteogenomic Characterization of Pancreatic Ductal Adenocarcinoma.Cell184, 19 (2021), 5031–5052

work page 2021
[3]

2012.Principles of Data Integration (1st ed.)

AnHai Doan, Alon Halevy, and Zachary Ives. 2012.Principles of Data Integration (1st ed.). Morgan Kaufmann Publishers Inc

work page 2012
[4]

Yongchao Dou, Lizabeth Katsnelson, Marina Gritsenko, Yingwei Hu, Boris Reva, Runyu Hong, Yi-Ting Wang, et al. 2023. Proteogenomic Insights Suggest Drug- gable Pathways in Endometrial Carcinoma.Cancer Cell41, 9 (2023), 1586–1605

work page 2023
[5]

Yongchao Dou, Emily Kawaler, Daniel Cui Zhou, Marina Gritsenko, Chen Huang, Lili Blumenberg, Alla Karpova, Vladislav Petyuk, et al . 2020. Proteogenomic Characterization of Endometrial Carcinoma.Cell180, 4 (2020), 729–748

work page 2020
[6]

Christos Koutras, George Siachamis, Andra Ionescu, Kyriakos Psarakis, Jerry Brons, Marios Fragkoulis, Christoph Lofi, Angela Bonifati, and Asterios Katsi- fodimos. 2021. Valentine: Evaluating Matching Techniques for Dataset Discovery. InProceedings of International Conference on Data Engineering (ICDE). 468–479

work page 2021
[7]

Yize Li, Yongchao Dou, Felipe Da Veiga Leprevost, Yifat Geffen, Anna Calinawan, François Aguet, Yo Akiyama, et al. 2023. Proteogenomic Data and Resources for Pan-cancer Analysis.Cancer Cell41, 8 (2023), 1397–1406

work page 2023
[8]

Yurong Liu, Eduardo H. M. Pena, Aécio Santos, Eden Wu, and Juliana Freire. 2025. Magneto: Combining Small and Large Language Models for Schema Matching. Proceedings of the VLDB Endowment (PVLDB)18, 8 (2025), 2681–2694

work page 2025
[9]

Roque Lopez, Aécio Santos, Christos Koutras, and Juliana Freire. 2026. BDI-Kit: An AI-Powered Toolkit for Biomedical Data Harmonization.Patterns7 (2026)

work page 2026
[10]

Renée Miller. 2018. Open Data Integration.Proceedings of the VLDB Endowment (PVLDB)11, 12 (2018), 2130–2139

work page 2018

[1] [1]

Cafarella, Alon Halevy, and Nodira Khoussainova

Michael J. Cafarella, Alon Halevy, and Nodira Khoussainova. 2009. Data Integra- tion for the Relational Web.Proceedings of the VLDB Endowment (PVLDB)2, 1 (2009), 1090–1101

work page 2009

[2] [2]

Liwei Cao, Chen Huang, Daniel Cui Zhou, Yingwei Hu, Mamie Lih, Sara Savage, Karsten Krug, David Clark, et al . 2021. Proteogenomic Characterization of Pancreatic Ductal Adenocarcinoma.Cell184, 19 (2021), 5031–5052

work page 2021

[3] [3]

2012.Principles of Data Integration (1st ed.)

AnHai Doan, Alon Halevy, and Zachary Ives. 2012.Principles of Data Integration (1st ed.). Morgan Kaufmann Publishers Inc

work page 2012

[4] [4]

Yongchao Dou, Lizabeth Katsnelson, Marina Gritsenko, Yingwei Hu, Boris Reva, Runyu Hong, Yi-Ting Wang, et al. 2023. Proteogenomic Insights Suggest Drug- gable Pathways in Endometrial Carcinoma.Cancer Cell41, 9 (2023), 1586–1605

work page 2023

[5] [5]

Yongchao Dou, Emily Kawaler, Daniel Cui Zhou, Marina Gritsenko, Chen Huang, Lili Blumenberg, Alla Karpova, Vladislav Petyuk, et al . 2020. Proteogenomic Characterization of Endometrial Carcinoma.Cell180, 4 (2020), 729–748

work page 2020

[6] [6]

Christos Koutras, George Siachamis, Andra Ionescu, Kyriakos Psarakis, Jerry Brons, Marios Fragkoulis, Christoph Lofi, Angela Bonifati, and Asterios Katsi- fodimos. 2021. Valentine: Evaluating Matching Techniques for Dataset Discovery. InProceedings of International Conference on Data Engineering (ICDE). 468–479

work page 2021

[7] [7]

Yize Li, Yongchao Dou, Felipe Da Veiga Leprevost, Yifat Geffen, Anna Calinawan, François Aguet, Yo Akiyama, et al. 2023. Proteogenomic Data and Resources for Pan-cancer Analysis.Cancer Cell41, 8 (2023), 1397–1406

work page 2023

[8] [8]

Yurong Liu, Eduardo H. M. Pena, Aécio Santos, Eden Wu, and Juliana Freire. 2025. Magneto: Combining Small and Large Language Models for Schema Matching. Proceedings of the VLDB Endowment (PVLDB)18, 8 (2025), 2681–2694

work page 2025

[9] [9]

Roque Lopez, Aécio Santos, Christos Koutras, and Juliana Freire. 2026. BDI-Kit: An AI-Powered Toolkit for Biomedical Data Harmonization.Patterns7 (2026)

work page 2026

[10] [10]

Renée Miller. 2018. Open Data Integration.Proceedings of the VLDB Endowment (PVLDB)11, 12 (2018), 2130–2139

work page 2018