OpenHealth Lake: Designing and testing a data lakehouse platform for health applications

Cheryl Baxter; Danilo Silva; Joicymara Xavier; Marcel Dunaiski; Monika Moir; Tulio de Oliveira

arxiv: 2605.19922 · v1 · pith:X22ST5UInew · submitted 2026-05-19 · 💻 cs.SE · cs.DB

OpenHealth Lake: Designing and testing a data lakehouse platform for health applications

Danilo Silva , Monika Moir , Cheryl Baxter , Tulio de Oliveira , Joicymara Xavier , Marcel Dunaiski This is my paper

Pith reviewed 2026-05-20 03:40 UTC · model grok-4.3

classification 💻 cs.SE cs.DB

keywords data lakehousehealth data managementFAIR principlesdata federationuser studycollaborative researchopen-source platformbioinformatics

0 comments

The pith

OpenHealth Lake prototype delivers a usable data lakehouse for health data sharing across technical skill levels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces OpenHealth Lake, a prototype data management platform for health sciences and bioinformatics that addresses the challenge of handling large heterogeneous datasets. Built on a data lakehouse architecture with data federation and FAIR principles using open-source tools, the design follows system requirements drawn from earlier studies. The platform provides access through a website, open API, and Python and R packages. A user study with participants holding different technical backgrounds showed the system is both usable and useful for supporting secure storage, exchange, and governance in collaborative global health work. The approach offers organizations a flexible template they can adapt to cloud or self-hosted setups.

Core claim

OpenHealth Lake is a data management prototype platform based on a data lakehouse architecture, data federation, and the FAIR principles. Designed using open-source tools and guided by identified system requirements, it comprises a user-friendly website, an open API, and Python and R packages. A user study confirmed its usability and usefulness for participants with varying technical backgrounds, showcasing adaptability, scalability, and reproducibility for any organization.

What carries the argument

OpenHealth Lake prototype, which integrates data lakehouse architecture, data federation, and FAIR principles to support secure storage, sharing, and governance of heterogeneous health datasets through multiple access methods.

If this is right

The platform enables efficient data exchange and governance within collaborative global health initiatives.
Users with varying technical backgrounds can interact via website, API, or Python and R packages.
Organizations can customize the system to match their specific requirements and resources including cloud or self-hosted storage.
The lakehouse design supports adaptability, scalability, and reproducibility across different health data settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar lakehouse patterns could be tested in other data-heavy domains such as environmental monitoring or genomics consortia.
Adding built-in analytics layers might let teams move directly from storage to insight without external tools.
Long-term monitoring of adoption rates in actual projects would reveal whether the prototype sustains use beyond initial testing.

Load-bearing premise

The system requirements identified in previously published studies and complemented by insights from the existing literature are sufficient and accurate to guide the design of a platform that meets the needs of collaborative global health initiatives.

What would settle it

A follow-up deployment in a real global health collaboration where users report persistent difficulties with data exchange or governance would show the requirements were insufficient.

Figures

Figures reproduced from arXiv: 2605.19922 by Cheryl Baxter, Danilo Silva, Joicymara Xavier, Marcel Dunaiski, Monika Moir, Tulio de Oliveira.

**Figure 2.** Figure 2: Simplified API data flow diagram. The diagram illustrates the operations and data flows of each core API [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: API infrastructure diagram. This diagram shows the OpenHealth Lake’s API design and its main services. [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Total number of participants versus number of participants who skipped each task. No participants skipped [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Usability responses based on user perception per task. The bubble chart illustrates the total number of [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗

**Figure 6.** Figure 6: Distribution of time spent per task. Each violin shape represents the spread and density of participants’ task [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗

**Figure 7.** Figure 7: Rate of correct responses per task. Tasks 0 and 1 are excluded since they did not assess critical prototype [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗

**Figure 8.** Figure 8: Usefulness responses for each functionality. The usefulness of each task was rated in the final feedback form [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗

**Figure 9.** Figure 9: Lakehouse application database model [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗

**Figure 10.** Figure 10: User study dropout rate. 16 [PITH_FULL_IMAGE:figures/full_fig_p016_10.png] view at source ↗

read the original abstract

Data management can be a complex challenge in fields such as bioinformatics and health sciences, which continuously generate extensive heterogeneous datasets. In the context of collaborative global health initiatives, secure storage and sharing of data are crucial to support impactful research. However, the absence of a unified data management platform complicates efficient data exchange and governance within these initiatives. In this paper, we introduce the design process of OpenHealth Lake, a data management prototype platform based on a data lakehouse architecture, data federation, and the FAIR principles. The platform is designed using open-source tools, guided by system requirements identified in previously published studies and complemented by insights from the existing literature. The current prototype platform comprises a user-friendly website, an open API, Python and R packages, allowing users to interact with the platform in multiple ways. Through a user study that included participants with varying technical backgrounds, we showed that our proposed data management prototype is both usable and useful. Our prototype design showcases the adaptability, scalability, and reproducibility of a lakehouse system that can be used by any organisation. It is designed as a flexible and complementary approach that allows organisations to customise data management systems to their specific requirements and resources, including cloud-based or self-hosted storage choices.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

OpenHealth Lake is a concrete prototype for health data management but the user study is too lightly described to carry the main claims.

read the letter

This paper walks through the design and initial testing of OpenHealth Lake, a data lakehouse prototype aimed at collaborative health and bioinformatics work. It uses established lakehouse ideas plus data federation and FAIR principles, all built from open-source pieces, and includes a website, open API, and Python/R packages for different users. The authors say the system is flexible enough for organizations to run in the cloud or self-hosted and that a user study with mixed technical backgrounds found it usable and useful. That practical build is the clearest part of the work and gives readers something they can actually look at or adapt. The emphasis on multiple access methods and organizational flexibility is a reasonable fit for the messy reality of global health data sharing. The design draws from requirements in earlier studies plus literature, which is a standard way to start but leaves open whether those requirements fully cover current cross-border governance or privacy needs in this domain. The evaluation section is the soft spot. The abstract and description mention a user study but supply no participant count, task details, metrics, or analysis, so it is hard to know how much the positive feedback should weigh. If the prior requirements missed key collaborative health constraints, the study would mainly validate a design that might not match the real target. Readers working on health informatics platforms or open data tools will find the most value here as an example implementation rather than a source of new architecture. It is not a big theoretical step but the topic is relevant and the prototype is real. I would bring it to a reading group to discuss the actual code and interfaces. It deserves peer review because the work is grounded enough to be worth referee time, though the authors should expect questions on the study design and on how the requirements were validated against fresh scenarios.

Referee Report

2 major / 1 minor

Summary. The manuscript presents the design process and prototype implementation of OpenHealth Lake, a data lakehouse platform for health applications in collaborative global health initiatives. It is constructed using open-source tools following a data lakehouse architecture with data federation and FAIR principles, guided by system requirements from prior published studies supplemented by literature insights. The prototype includes a user-friendly website, an open API, and Python and R packages. A user study with participants of varying technical backgrounds is reported to demonstrate that the platform is both usable and useful, with claims of adaptability, scalability, and reproducibility for organizational customization including cloud or self-hosted options.

Significance. If the user study and requirement validation provide robust, detailed evidence, the work could offer a practical contribution by demonstrating a flexible, open-source lakehouse approach tailored to heterogeneous health data management and governance needs. It explicitly credits the use of established open-source components and multi-language access methods, which supports reproducibility and adoption. However, the current lack of concrete evaluation details limits assessment of whether it meaningfully advances beyond existing data lakehouse applications in the health domain.

major comments (2)

User Study section: the central claim that the prototype is usable and useful rests on this study, yet no details are provided on participant count, recruitment, specific tasks or scenarios tested (e.g., data sharing workflows or privacy controls), metrics (such as SUS scores or completion rates), or any statistical analysis. This absence prevents evaluation of the evidence strength and leaves the headline result unsupported in its current form.
Design Process section: system requirements are stated to derive from previously published studies and literature, but there is no explicit mapping (e.g., table or subsection) linking each requirement to corresponding prototype features, nor any validation step testing the resulting platform against fresh domain-specific scenarios such as cross-border governance or multi-stakeholder privacy constraints. Without this, the design choices cannot be confirmed as sufficient for the targeted collaborative global health use cases.

minor comments (1)

Abstract: key quantitative or qualitative outcomes from the user study (e.g., average ratings or main feedback themes) are omitted, which would strengthen the summary of results.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback, which highlights opportunities to strengthen the presentation of our user study and design validation. We will revise the manuscript accordingly to provide greater transparency and evidence for our claims.

read point-by-point responses

Referee: User Study section: the central claim that the prototype is usable and useful rests on this study, yet no details are provided on participant count, recruitment, specific tasks or scenarios tested (e.g., data sharing workflows or privacy controls), metrics (such as SUS scores or completion rates), or any statistical analysis. This absence prevents evaluation of the evidence strength and leaves the headline result unsupported in its current form.

Authors: We agree that additional methodological details are needed to allow readers to assess the strength of the usability claims. In the revised manuscript, we will expand the User Study section to report the exact participant count, recruitment approach (via targeted invitations to health informatics and global health networks), the specific tasks and scenarios evaluated (including data ingestion, federated querying, sharing workflows, and privacy control interactions), the quantitative metrics collected (System Usability Scale scores, task completion rates, and time-on-task), and the statistical methods applied to the results. These additions will directly support the claims of usability and usefulness while maintaining the original study design. revision: yes
Referee: Design Process section: system requirements are stated to derive from previously published studies and literature, but there is no explicit mapping (e.g., table or subsection) linking each requirement to corresponding prototype features, nor any validation step testing the resulting platform against fresh domain-specific scenarios such as cross-border governance or multi-stakeholder privacy constraints. Without this, the design choices cannot be confirmed as sufficient for the targeted collaborative global health use cases.

Authors: We accept that an explicit traceability between requirements and features would improve the rigor of the design narrative. The revised Design Process section will include a new table (or dedicated subsection) that maps each stated requirement to the corresponding OpenHealth Lake features (e.g., data federation for cross-border access, FAIR-compliant metadata for governance). We will also add a short validation discussion that illustrates how the implemented architecture addresses representative scenarios such as multi-stakeholder privacy constraints and cross-border data governance, leveraging the existing data lakehouse and federation components. This will demonstrate sufficiency for the intended use cases without altering the core design. revision: yes

Circularity Check

0 steps flagged

No significant circularity; evaluation rests on independent user study

full rationale

The paper describes a design process for OpenHealth Lake guided by requirements drawn from previously published studies plus literature, then evaluates the resulting prototype via a separate user study with participants of varying technical backgrounds. The central claim (usability and usefulness) is supported by this external user-study feedback rather than by any self-referential fitting, redefinition, or reduction of the evaluation metrics to the design inputs. No equations, fitted parameters, or uniqueness theorems appear; self-citations of prior requirements work do not bear the load of the reported results, which remain independently testable. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The design rests on domain assumptions about data heterogeneity and governance needs in health research, with no free parameters or invented entities explicitly introduced in the abstract.

axioms (2)

domain assumption Data management can be a complex challenge in fields such as bioinformatics and health sciences, which continuously generate extensive heterogeneous datasets.
Opening statement of the abstract that motivates the entire platform design.
domain assumption The absence of a unified data management platform complicates efficient data exchange and governance within collaborative global health initiatives.
Core problem statement used to justify the need for OpenHealth Lake.

pith-pipeline@v0.9.0 · 5760 in / 1258 out tokens · 33220 ms · 2026-05-20T03:40:50.105709+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages

[1]

, title =

MongoDB, Inc. , title =. 2025 , url =

work page 2025
[2]

, title =

Couchbase, Inc. , title =. 2025 , url =

work page 2025
[3]

, title =

Couchbase, Inc. , title =

work page
[4]

International Journal of Information Technology , pages=

A systematic review of software usability studies , author=. International Journal of Information Technology , pages=. 2017 , publisher=

work page 2017
[5]

Haas, L. M. and Lin, E. T. and Roth, M. A. , journal=. Data integration through database federation , year=

work page
[6]

Database , volume=

BioMart: a data federation framework for large collaborative projects , author=. Database , volume=. 2011 , publisher=

work page 2011
[7]

Life , volume=

The R language: an engine for bioinformatics and data science , author=. Life , volume=. 2022 , publisher=

work page 2022
[8]

2018 , url =

Tiangolo , title =. 2018 , url =

work page 2018
[9]

2025 , url =

The Linux Foundation , title =. 2025 , url =

work page 2025
[10]

2022 , url =

The Apache Foundation , title =. 2022 , url =

work page 2022
[11]

and Zulkernine, Farhana , booktitle=

Harby, Ahmed A. and Zulkernine, Farhana , booktitle=. From Data Warehouse to Lakehouse: A Comparative Review , year=

work page
[12]

Proceedings of CIDR , volume=

Lakehouse: a new generation of open platforms that unify data warehousing and advanced analytics , author=. Proceedings of CIDR , volume=

work page
[13]

, author=

Data Wrangling: The Challenging Yourney from the Wild to the Lake. , author=. CIDR , year=

work page
[14]

, author=

Assessing the Lakehouse: Analysis, Requirements and Definition. , author=. ICEIS (1) , pages=

work page
[15]

SN Computer Science , volume=

The lakehouse: State of the art on concepts and technologies , author=. SN Computer Science , volume=. 2024 , publisher=

work page 2024
[16]

Proceedings of the 34th GI-Workshop on Foundations of Databases , year=

The Data Platform Evolution: From Data Warehouses over Data Lakes to Lakehouses , author=. Proceedings of the 34th GI-Workshop on Foundations of Databases , year=

work page
[17]

Scientific Data, 3 (1), Article 160018 , author=

The FAIR guiding principles for scientific data management and stewardship. Scientific Data, 3 (1), Article 160018 , author=

work page
[18]

Journal of biomedical informatics , volume=

‘Big data’, Hadoop and cloud computing in genomics , author=. Journal of biomedical informatics , volume=. 2013 , publisher=

work page 2013
[19]

South African journal of science , volume=

Managing and assembling population-scale data streams, tools and workflows to plan for future pandemics within the INFORM-Africa Consortium , author=. South African journal of science , volume=

work page
[20]

Journal of Data, Information and Management , volume=

The evolution of data storage architectures: examining the secure value of the Data Lakehouse , author=. Journal of Data, Information and Management , volume=. 2024 , publisher=

work page 2024
[21]

International Conference on Big Data Analytics and Knowledge Discovery , pages=

Benchmarking data lakes featuring structured and unstructured data with dlbench , author=. International Conference on Big Data Analytics and Knowledge Discovery , pages=. 2021 , organization=

work page 2021
[22]

Bioinformatics Advances , volume =

Silva, Danilo and Moir, Monika and Dunaiski, Marcel and Blanco, Natalia and Murtala-Ibrahim, Fati and Baxter, Cheryl and de Oliveira, Tulio and Xavier, Joicymara S and The INFORM Africa Research Study Group , title =. Bioinformatics Advances , volume =

work page
[23]

Scientific Data , volume=

Introducing the FAIR Principles for research software , author=. Scientific Data , volume=. 2022 , publisher=

work page 2022

[1] [1]

, title =

MongoDB, Inc. , title =. 2025 , url =

work page 2025

[2] [2]

, title =

Couchbase, Inc. , title =. 2025 , url =

work page 2025

[3] [3]

, title =

Couchbase, Inc. , title =

work page

[4] [4]

International Journal of Information Technology , pages=

A systematic review of software usability studies , author=. International Journal of Information Technology , pages=. 2017 , publisher=

work page 2017

[5] [5]

Haas, L. M. and Lin, E. T. and Roth, M. A. , journal=. Data integration through database federation , year=

work page

[6] [6]

Database , volume=

BioMart: a data federation framework for large collaborative projects , author=. Database , volume=. 2011 , publisher=

work page 2011

[7] [7]

Life , volume=

The R language: an engine for bioinformatics and data science , author=. Life , volume=. 2022 , publisher=

work page 2022

[8] [8]

2018 , url =

Tiangolo , title =. 2018 , url =

work page 2018

[9] [9]

2025 , url =

The Linux Foundation , title =. 2025 , url =

work page 2025

[10] [10]

2022 , url =

The Apache Foundation , title =. 2022 , url =

work page 2022

[11] [11]

and Zulkernine, Farhana , booktitle=

Harby, Ahmed A. and Zulkernine, Farhana , booktitle=. From Data Warehouse to Lakehouse: A Comparative Review , year=

work page

[12] [12]

Proceedings of CIDR , volume=

Lakehouse: a new generation of open platforms that unify data warehousing and advanced analytics , author=. Proceedings of CIDR , volume=

work page

[13] [13]

, author=

Data Wrangling: The Challenging Yourney from the Wild to the Lake. , author=. CIDR , year=

work page

[14] [14]

, author=

Assessing the Lakehouse: Analysis, Requirements and Definition. , author=. ICEIS (1) , pages=

work page

[15] [15]

SN Computer Science , volume=

The lakehouse: State of the art on concepts and technologies , author=. SN Computer Science , volume=. 2024 , publisher=

work page 2024

[16] [16]

Proceedings of the 34th GI-Workshop on Foundations of Databases , year=

The Data Platform Evolution: From Data Warehouses over Data Lakes to Lakehouses , author=. Proceedings of the 34th GI-Workshop on Foundations of Databases , year=

work page

[17] [17]

Scientific Data, 3 (1), Article 160018 , author=

The FAIR guiding principles for scientific data management and stewardship. Scientific Data, 3 (1), Article 160018 , author=

work page

[18] [18]

Journal of biomedical informatics , volume=

‘Big data’, Hadoop and cloud computing in genomics , author=. Journal of biomedical informatics , volume=. 2013 , publisher=

work page 2013

[19] [19]

South African journal of science , volume=

Managing and assembling population-scale data streams, tools and workflows to plan for future pandemics within the INFORM-Africa Consortium , author=. South African journal of science , volume=

work page

[20] [20]

Journal of Data, Information and Management , volume=

The evolution of data storage architectures: examining the secure value of the Data Lakehouse , author=. Journal of Data, Information and Management , volume=. 2024 , publisher=

work page 2024

[21] [21]

International Conference on Big Data Analytics and Knowledge Discovery , pages=

Benchmarking data lakes featuring structured and unstructured data with dlbench , author=. International Conference on Big Data Analytics and Knowledge Discovery , pages=. 2021 , organization=

work page 2021

[22] [22]

Bioinformatics Advances , volume =

Silva, Danilo and Moir, Monika and Dunaiski, Marcel and Blanco, Natalia and Murtala-Ibrahim, Fati and Baxter, Cheryl and de Oliveira, Tulio and Xavier, Joicymara S and The INFORM Africa Research Study Group , title =. Bioinformatics Advances , volume =

work page

[23] [23]

Scientific Data , volume=

Introducing the FAIR Principles for research software , author=. Scientific Data , volume=. 2022 , publisher=

work page 2022