pith. sign in

arxiv: 2605.19922 · v1 · pith:X22ST5UInew · submitted 2026-05-19 · 💻 cs.SE · cs.DB

OpenHealth Lake: Designing and testing a data lakehouse platform for health applications

Pith reviewed 2026-05-20 03:40 UTC · model grok-4.3

classification 💻 cs.SE cs.DB
keywords data lakehousehealth data managementFAIR principlesdata federationuser studycollaborative researchopen-source platformbioinformatics
0
0 comments X

The pith

OpenHealth Lake prototype delivers a usable data lakehouse for health data sharing across technical skill levels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces OpenHealth Lake, a prototype data management platform for health sciences and bioinformatics that addresses the challenge of handling large heterogeneous datasets. Built on a data lakehouse architecture with data federation and FAIR principles using open-source tools, the design follows system requirements drawn from earlier studies. The platform provides access through a website, open API, and Python and R packages. A user study with participants holding different technical backgrounds showed the system is both usable and useful for supporting secure storage, exchange, and governance in collaborative global health work. The approach offers organizations a flexible template they can adapt to cloud or self-hosted setups.

Core claim

OpenHealth Lake is a data management prototype platform based on a data lakehouse architecture, data federation, and the FAIR principles. Designed using open-source tools and guided by identified system requirements, it comprises a user-friendly website, an open API, and Python and R packages. A user study confirmed its usability and usefulness for participants with varying technical backgrounds, showcasing adaptability, scalability, and reproducibility for any organization.

What carries the argument

OpenHealth Lake prototype, which integrates data lakehouse architecture, data federation, and FAIR principles to support secure storage, sharing, and governance of heterogeneous health datasets through multiple access methods.

If this is right

  • The platform enables efficient data exchange and governance within collaborative global health initiatives.
  • Users with varying technical backgrounds can interact via website, API, or Python and R packages.
  • Organizations can customize the system to match their specific requirements and resources including cloud or self-hosted storage.
  • The lakehouse design supports adaptability, scalability, and reproducibility across different health data settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar lakehouse patterns could be tested in other data-heavy domains such as environmental monitoring or genomics consortia.
  • Adding built-in analytics layers might let teams move directly from storage to insight without external tools.
  • Long-term monitoring of adoption rates in actual projects would reveal whether the prototype sustains use beyond initial testing.

Load-bearing premise

The system requirements identified in previously published studies and complemented by insights from the existing literature are sufficient and accurate to guide the design of a platform that meets the needs of collaborative global health initiatives.

What would settle it

A follow-up deployment in a real global health collaboration where users report persistent difficulties with data exchange or governance would show the requirements were insufficient.

Figures

Figures reproduced from arXiv: 2605.19922 by Cheryl Baxter, Danilo Silva, Joicymara Xavier, Marcel Dunaiski, Monika Moir, Tulio de Oliveira.

Figure 1
Figure 1. Figure 1: Lakehouse Infrastructure design. The system comprises three main modules: the application database, [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Simplified API data flow diagram. The diagram illustrates the operations and data flows of each core API [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: API infrastructure diagram. This diagram shows the OpenHealth Lake’s API design and its main services. [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Total number of participants versus number of participants who skipped each task. No participants skipped [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Usability responses based on user perception per task. The bubble chart illustrates the total number of [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Distribution of time spent per task. Each violin shape represents the spread and density of participants’ task [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Rate of correct responses per task. Tasks 0 and 1 are excluded since they did not assess critical prototype [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Usefulness responses for each functionality. The usefulness of each task was rated in the final feedback form [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Lakehouse application database model [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: User study dropout rate. 16 [PITH_FULL_IMAGE:figures/full_fig_p016_10.png] view at source ↗
read the original abstract

Data management can be a complex challenge in fields such as bioinformatics and health sciences, which continuously generate extensive heterogeneous datasets. In the context of collaborative global health initiatives, secure storage and sharing of data are crucial to support impactful research. However, the absence of a unified data management platform complicates efficient data exchange and governance within these initiatives. In this paper, we introduce the design process of OpenHealth Lake, a data management prototype platform based on a data lakehouse architecture, data federation, and the FAIR principles. The platform is designed using open-source tools, guided by system requirements identified in previously published studies and complemented by insights from the existing literature. The current prototype platform comprises a user-friendly website, an open API, Python and R packages, allowing users to interact with the platform in multiple ways. Through a user study that included participants with varying technical backgrounds, we showed that our proposed data management prototype is both usable and useful. Our prototype design showcases the adaptability, scalability, and reproducibility of a lakehouse system that can be used by any organisation. It is designed as a flexible and complementary approach that allows organisations to customise data management systems to their specific requirements and resources, including cloud-based or self-hosted storage choices.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript presents the design process and prototype implementation of OpenHealth Lake, a data lakehouse platform for health applications in collaborative global health initiatives. It is constructed using open-source tools following a data lakehouse architecture with data federation and FAIR principles, guided by system requirements from prior published studies supplemented by literature insights. The prototype includes a user-friendly website, an open API, and Python and R packages. A user study with participants of varying technical backgrounds is reported to demonstrate that the platform is both usable and useful, with claims of adaptability, scalability, and reproducibility for organizational customization including cloud or self-hosted options.

Significance. If the user study and requirement validation provide robust, detailed evidence, the work could offer a practical contribution by demonstrating a flexible, open-source lakehouse approach tailored to heterogeneous health data management and governance needs. It explicitly credits the use of established open-source components and multi-language access methods, which supports reproducibility and adoption. However, the current lack of concrete evaluation details limits assessment of whether it meaningfully advances beyond existing data lakehouse applications in the health domain.

major comments (2)
  1. User Study section: the central claim that the prototype is usable and useful rests on this study, yet no details are provided on participant count, recruitment, specific tasks or scenarios tested (e.g., data sharing workflows or privacy controls), metrics (such as SUS scores or completion rates), or any statistical analysis. This absence prevents evaluation of the evidence strength and leaves the headline result unsupported in its current form.
  2. Design Process section: system requirements are stated to derive from previously published studies and literature, but there is no explicit mapping (e.g., table or subsection) linking each requirement to corresponding prototype features, nor any validation step testing the resulting platform against fresh domain-specific scenarios such as cross-border governance or multi-stakeholder privacy constraints. Without this, the design choices cannot be confirmed as sufficient for the targeted collaborative global health use cases.
minor comments (1)
  1. Abstract: key quantitative or qualitative outcomes from the user study (e.g., average ratings or main feedback themes) are omitted, which would strengthen the summary of results.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback, which highlights opportunities to strengthen the presentation of our user study and design validation. We will revise the manuscript accordingly to provide greater transparency and evidence for our claims.

read point-by-point responses
  1. Referee: User Study section: the central claim that the prototype is usable and useful rests on this study, yet no details are provided on participant count, recruitment, specific tasks or scenarios tested (e.g., data sharing workflows or privacy controls), metrics (such as SUS scores or completion rates), or any statistical analysis. This absence prevents evaluation of the evidence strength and leaves the headline result unsupported in its current form.

    Authors: We agree that additional methodological details are needed to allow readers to assess the strength of the usability claims. In the revised manuscript, we will expand the User Study section to report the exact participant count, recruitment approach (via targeted invitations to health informatics and global health networks), the specific tasks and scenarios evaluated (including data ingestion, federated querying, sharing workflows, and privacy control interactions), the quantitative metrics collected (System Usability Scale scores, task completion rates, and time-on-task), and the statistical methods applied to the results. These additions will directly support the claims of usability and usefulness while maintaining the original study design. revision: yes

  2. Referee: Design Process section: system requirements are stated to derive from previously published studies and literature, but there is no explicit mapping (e.g., table or subsection) linking each requirement to corresponding prototype features, nor any validation step testing the resulting platform against fresh domain-specific scenarios such as cross-border governance or multi-stakeholder privacy constraints. Without this, the design choices cannot be confirmed as sufficient for the targeted collaborative global health use cases.

    Authors: We accept that an explicit traceability between requirements and features would improve the rigor of the design narrative. The revised Design Process section will include a new table (or dedicated subsection) that maps each stated requirement to the corresponding OpenHealth Lake features (e.g., data federation for cross-border access, FAIR-compliant metadata for governance). We will also add a short validation discussion that illustrates how the implemented architecture addresses representative scenarios such as multi-stakeholder privacy constraints and cross-border data governance, leveraging the existing data lakehouse and federation components. This will demonstrate sufficiency for the intended use cases without altering the core design. revision: yes

Circularity Check

0 steps flagged

No significant circularity; evaluation rests on independent user study

full rationale

The paper describes a design process for OpenHealth Lake guided by requirements drawn from previously published studies plus literature, then evaluates the resulting prototype via a separate user study with participants of varying technical backgrounds. The central claim (usability and usefulness) is supported by this external user-study feedback rather than by any self-referential fitting, redefinition, or reduction of the evaluation metrics to the design inputs. No equations, fitted parameters, or uniqueness theorems appear; self-citations of prior requirements work do not bear the load of the reported results, which remain independently testable. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The design rests on domain assumptions about data heterogeneity and governance needs in health research, with no free parameters or invented entities explicitly introduced in the abstract.

axioms (2)
  • domain assumption Data management can be a complex challenge in fields such as bioinformatics and health sciences, which continuously generate extensive heterogeneous datasets.
    Opening statement of the abstract that motivates the entire platform design.
  • domain assumption The absence of a unified data management platform complicates efficient data exchange and governance within collaborative global health initiatives.
    Core problem statement used to justify the need for OpenHealth Lake.

pith-pipeline@v0.9.0 · 5760 in / 1258 out tokens · 33220 ms · 2026-05-20T03:40:50.105709+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages

  1. [1]

    , title =

    MongoDB, Inc. , title =. 2025 , url =

  2. [2]

    , title =

    Couchbase, Inc. , title =. 2025 , url =

  3. [3]

    , title =

    Couchbase, Inc. , title =

  4. [4]

    International Journal of Information Technology , pages=

    A systematic review of software usability studies , author=. International Journal of Information Technology , pages=. 2017 , publisher=

  5. [5]

    Haas, L. M. and Lin, E. T. and Roth, M. A. , journal=. Data integration through database federation , year=

  6. [6]

    Database , volume=

    BioMart: a data federation framework for large collaborative projects , author=. Database , volume=. 2011 , publisher=

  7. [7]

    Life , volume=

    The R language: an engine for bioinformatics and data science , author=. Life , volume=. 2022 , publisher=

  8. [8]

    2018 , url =

    Tiangolo , title =. 2018 , url =

  9. [9]

    2025 , url =

    The Linux Foundation , title =. 2025 , url =

  10. [10]

    2022 , url =

    The Apache Foundation , title =. 2022 , url =

  11. [11]

    and Zulkernine, Farhana , booktitle=

    Harby, Ahmed A. and Zulkernine, Farhana , booktitle=. From Data Warehouse to Lakehouse: A Comparative Review , year=

  12. [12]

    Proceedings of CIDR , volume=

    Lakehouse: a new generation of open platforms that unify data warehousing and advanced analytics , author=. Proceedings of CIDR , volume=

  13. [13]

    , author=

    Data Wrangling: The Challenging Yourney from the Wild to the Lake. , author=. CIDR , year=

  14. [14]

    , author=

    Assessing the Lakehouse: Analysis, Requirements and Definition. , author=. ICEIS (1) , pages=

  15. [15]

    SN Computer Science , volume=

    The lakehouse: State of the art on concepts and technologies , author=. SN Computer Science , volume=. 2024 , publisher=

  16. [16]

    Proceedings of the 34th GI-Workshop on Foundations of Databases , year=

    The Data Platform Evolution: From Data Warehouses over Data Lakes to Lakehouses , author=. Proceedings of the 34th GI-Workshop on Foundations of Databases , year=

  17. [17]

    Scientific Data, 3 (1), Article 160018 , author=

    The FAIR guiding principles for scientific data management and stewardship. Scientific Data, 3 (1), Article 160018 , author=

  18. [18]

    Journal of biomedical informatics , volume=

    ‘Big data’, Hadoop and cloud computing in genomics , author=. Journal of biomedical informatics , volume=. 2013 , publisher=

  19. [19]

    South African journal of science , volume=

    Managing and assembling population-scale data streams, tools and workflows to plan for future pandemics within the INFORM-Africa Consortium , author=. South African journal of science , volume=

  20. [20]

    Journal of Data, Information and Management , volume=

    The evolution of data storage architectures: examining the secure value of the Data Lakehouse , author=. Journal of Data, Information and Management , volume=. 2024 , publisher=

  21. [21]

    International Conference on Big Data Analytics and Knowledge Discovery , pages=

    Benchmarking data lakes featuring structured and unstructured data with dlbench , author=. International Conference on Big Data Analytics and Knowledge Discovery , pages=. 2021 , organization=

  22. [22]

    Bioinformatics Advances , volume =

    Silva, Danilo and Moir, Monika and Dunaiski, Marcel and Blanco, Natalia and Murtala-Ibrahim, Fati and Baxter, Cheryl and de Oliveira, Tulio and Xavier, Joicymara S and The INFORM Africa Research Study Group , title =. Bioinformatics Advances , volume =

  23. [23]

    Scientific Data , volume=

    Introducing the FAIR Principles for research software , author=. Scientific Data , volume=. 2022 , publisher=