pith. sign in

arxiv: 2604.21449 · v1 · submitted 2026-04-23 · 💻 cs.DC · cs.DB

Research on the efficiency of data loading and storage in Data Lakehouse architectures for the formation of analytical data systems

Pith reviewed 2026-05-08 14:02 UTC · model grok-4.3

classification 💻 cs.DC cs.DB
keywords Data LakehouseDelta LakeApache IcebergApache Hudidata loading timestorage efficiencyApache SparkETL processes
0
0 comments X

The pith

Delta Lake loads data fastest while Apache Iceberg uses the least disk space among three common Lakehouse systems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests Apache Hudi, Apache Iceberg, and Delta Lake on loading and storing CSV and JSON files up to 7 GB using Apache Spark. It builds four sequential ETL steps for each system and measures two outcomes: time to complete the loads and the final size of the tables on disk. A reader cares because these metrics directly affect how fast and how cheaply teams can turn raw data into usable tables for analysis. The results separate the systems by priority: Delta Lake wins on speed for any size tested, Iceberg wins on smaller table footprints and stability, and Hudi trails on both measures for this batch workload.

Core claim

Experiments with structured and semi-structured data show Delta Lake completing loads in the shortest time regardless of volume, while Apache Iceberg consistently produces the smallest tables on disk and maintains stable behavior. Apache Hudi records longer load times and larger storage footprints in the same tasks. The study concludes that Delta Lake is the preferred architecture when loading speed is the main requirement, and Apache Iceberg is preferred when disk space savings and stability matter most.

What carries the argument

Four sequential ETL processes that read, transform, and write data into each Lakehouse table format, evaluated by load completion time and resulting table size in the file system.

Load-bearing premise

Performance measured on files up to 7 GB and four fixed ETL steps is enough to decide which architecture is optimal for analytical data systems in general.

What would settle it

Repeating the loads on a dataset larger than 7 GB or on a production-scale Spark cluster where Iceberg finishes faster than Delta Lake would show the speed ranking does not hold.

Figures

Figures reproduced from arXiv: 2604.21449 by Halyna Osukhivska, Ivan Borodii.

Figure 1
Figure 1. Figure 1: General structure diagram of data loading process in Data Lakehouse systems Apache Iceberg showed high performance, especially when working with smaller and medium￾sized datasets. For air_quality_level and world_ cities tables, loading times were 10.7 s and 6.16 s, respectively, which is only slightly different from Delta Lake's performance with 11.8% and 4.3%, accordingly. In the case of the large weather… view at source ↗
Figure 2
Figure 2. Figure 2: Comparative review of memory and performance for each Lakehouse view at source ↗
read the original abstract

The paper presents a study of the efficiency of loading and storing data in the three most common Data Lakehouse systems, including Apache Hudi, Apache Iceberg, and Delta Lake, using Apache Spark as a distributed data processing platform. The study analyzes the behavior of each system when processing structured (CSV) and semi-structured (JSON) data of different sizes, including loading files up to 7 GB in size. The purpose of the work is to determine the most optimal Data Lakehouse architecture based on the type and volume of data sources, data loading performance using Apache Spark, and disk size of data for forming analytical data systems. The research covers the development of four sequential ETL processes, which include reading, transforming, and loading data into tables in each of the Data Lakehouse systems. The efficiency of each Lakehouse was evaluated according to two key criteria: data loading time and the volume of tables formed in the file system. For the first time, a comparison of performance and data storage in Apache Iceberg, Apache Hudi, and Delta Lake Data Lakehouse systems was conducted to select the most relevant architecture for building analytical data systems. The practical value of the study consists in the fact that it assists data engineers and architects in choosing the most appropriate Lakehouse architecture, understanding the balance between loading performance and storage efficiency. Experimental results showed that Delta Lake is the most optimal architecture for systems where the priority is the speed of loading data of any volume, while Apache Iceberg is most appropriate for systems where stability and disk space savings are critical. Apache Hudi proved ineffective in data loading and storage evaluation tasks but could potentially be effective in incremental update and streaming processing scenarios.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The manuscript reports an experimental study comparing the data loading performance and storage efficiency of Apache Hudi, Apache Iceberg, and Delta Lake using Apache Spark for CSV and JSON datasets up to 7 GB. Through four sequential ETL processes, it measures loading times and final table sizes, concluding that Delta Lake offers the best loading speed for data of any volume, Iceberg provides superior stability and disk space savings, and Hudi is less effective for these tasks but may suit incremental updates.

Significance. If the reported performance differences hold under broader conditions, the work could assist practitioners in selecting appropriate Data Lakehouse architectures for analytical systems by balancing loading speed against storage efficiency. The empirical data on small-scale workloads adds to the limited body of comparative studies in this area.

major comments (3)
  1. [Abstract] The claim in the abstract that Delta Lake is the most optimal architecture for the speed of loading data of any volume is not supported by the experiments, which are limited to data sizes up to 7 GB with only four sequential ETL processes and no larger-scale runs, concurrent writers, update/merge workloads, or failure-injection tests.
  2. [Results] Stability is asserted as a key advantage for Apache Iceberg without an operational definition or any quantitative measurement; the evaluation criteria are restricted to loading time and final table volume, leaving the stability claim unsupported.
  3. [Experimental Setup] No details are provided on hardware specifications, number of runs, error bars, or statistical significance, which undermines assessment of whether the measured differences reliably support the optimality rankings.
minor comments (1)
  1. [Abstract and Introduction] The abstract and introduction repeat the study purpose and evaluation criteria multiple times; condensing this would improve readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

Thank you for the constructive referee report. We address each major comment below, indicating planned revisions to align claims with experimental scope and strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract] The claim in the abstract that Delta Lake is the most optimal architecture for the speed of loading data of any volume is not supported by the experiments, which are limited to data sizes up to 7 GB with only four sequential ETL processes and no larger-scale runs, concurrent writers, update/merge workloads, or failure-injection tests.

    Authors: We agree that the experiments cover only datasets up to 7 GB using four sequential ETL processes and do not include larger scales, concurrent writers, updates, merges, or failure tests. The phrasing 'data of any volume' extrapolates beyond the tested conditions. In revision, we will update the abstract to state that Delta Lake showed the fastest loading for the evaluated volumes up to 7 GB, and we will add an explicit limitations paragraph in the discussion section noting the restricted scope and the value of future larger-scale validation. revision: yes

  2. Referee: [Results] Stability is asserted as a key advantage for Apache Iceberg without an operational definition or any quantitative measurement; the evaluation criteria are restricted to loading time and final table volume, leaving the stability claim unsupported.

    Authors: The observation is accurate: our evaluation criteria were limited to loading time and final table size, with no operational definition or quantitative metrics for stability. References to 'superior stability' for Iceberg were informal and not data-driven. We will remove all claims about stability advantages from the abstract, results, and conclusions, restricting statements to the two measured criteria only. revision: yes

  3. Referee: [Experimental Setup] No details are provided on hardware specifications, number of runs, error bars, or statistical significance, which undermines assessment of whether the measured differences reliably support the optimality rankings.

    Authors: We will revise the Experimental Setup section to specify the hardware configuration (Spark cluster CPU, memory, and storage details), the number of runs executed per configuration, inclusion of error bars or standard deviations on reported times and sizes, and any statistical comparisons performed. These additions will enable readers to assess result reliability directly. revision: yes

Circularity Check

0 steps flagged

No circularity: claims rest solely on direct experimental measurements with no derivations or self-referential reductions.

full rationale

The paper conducts straightforward benchmarking of Apache Hudi, Iceberg, and Delta Lake via Spark on CSV/JSON inputs up to 7 GB using four sequential ETL steps. It reports measured loading times and final table sizes, then states conclusions about optimality for speed vs. stability/storage. No equations, fitted parameters, predictions, or mathematical derivations appear. No self-citations are invoked as load-bearing premises for uniqueness or ansatzes. The central claims do not reduce to their inputs by construction; they are empirical observations open to external replication or falsification. This matches the default expectation of no significant circularity for an experimental comparison paper.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on experimental observations rather than derivations; the main assumptions are about the representativeness of the test setup for analytical systems.

axioms (2)
  • domain assumption The chosen data formats (CSV, JSON) and sizes up to 7 GB represent typical workloads for analytical data systems.
    The study uses these to evaluate efficiency for forming analytical data systems.
  • domain assumption Apache Spark provides a fair and consistent platform for comparing the three Lakehouse systems.
    All tests use Spark as the processing platform.

pith-pipeline@v0.9.0 · 5612 in / 1271 out tokens · 59995 ms · 2026-05-08T14:02:51.483624+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages

  1. [1]

    K., Zaharia M

    Armbrust M., Das T., Zhu S., Hernandez I., Xin R., Bradley J. K., Zaharia M. Lakehouse: A new generation of open platforms that unify data warehousing and advanced analytics. Databricks Blog, 2021. URL: https:// www.databricks.com/blog/2020/01/30/what-is-a-data-lakehouse.html

  2. [2]

    Analyzing and Comparing Lakehouse Storage Systems

    Jain P., Kraft P., Power C., Das T., Stoica I., Zaharia M. Analyzing and Comparing Lakehouse Storage Systems. In: Conference on Innovative Data Systems Research (CIDR) , 2023, pp. 1–12. URL: https://mail. vldb.org/cidrdb/papers/2023/p92-jain.pdf

  3. [3]

    NightlyDocs

    Apache Iceberg Documentation. NightlyDocs. URL: https://iceberg.apache.org/docs/nightly/

  4. [4]

    Overview

    Apache Hudi Documentation. Overview. URL: https://hudi.apache.org/docs/overview

  5. [5]

    Delta.io

    Delta Lake Documentation. Delta.io. URL: https://docs.delta.io

  6. [9]

    Performance Benchmarking of Continuous Processing and Micro-Batch Modes in Spark Structured Streaming

    Fedorovych I., Osukhivska H., Lutsyk N. Performance Benchmarking of Continuous Processing and Micro-Batch Modes in Spark Structured Streaming. In: ITTAP 2024: 4th International Workshop on Information Technologies: Theoretical and Applied Problems , 20–22 November 2024, Ternopil, Ukraine, Opole, Poland, pp. 80–90. URL: https://ceur-ws.org/Vol-3896/paper5.pdf

  7. [11]

    Comparative analysis of large data processing in Apache Spark using Java, Python and Scala

    Borodii I., Fedorovych I., Osukhivska H., Velychko D., Butsii R. Comparative analysis of large data processing in Apache Spark using Java, Python and Scala. In: Proceedings of the 3rd International Workshop on Computer Information Technologies in Industry 4.0 (CITI 2025) , 11–12 June 2025, Ternopil, Ukraine, pp. 189–198. URL: https://ceur-ws.org/Vol-4057/...

  8. [12]

    URL: https://open-meteo.com/en/docs

    Open-Meteo: Weather Forecast API. URL: https://open-meteo.com/en/docs

  9. [13]

    URL: https://simplemaps.com/data/world-cities

    Simplemaps: World Cities Database. URL: https://simplemaps.com/data/world-cities

  10. [14]

    URL: https://api-docs.iqair.com/?version=latest REFERENCES:

    IQAir: Air Visual API. URL: https://api-docs.iqair.com/?version=latest REFERENCES:

  11. [15]

    K., & Zaharia, M

    Armbrust, M., Das, T., Zhu, S., Hernandez, I., Xin, R., Bradley, J. K., & Zaharia, M. (2021). Lakehouse: A new generation of open platforms that unify data warehousing and advanced analytics. Databricks Blog. Retrieved from: https://www.databricks.com/blog/2020/01/30/what-is-a-data-lakehouse.html

  12. [16]

    Jain, P., Kraft, P., Power, C., Das, T., Stoica, I., & Zaharia, M. (2023). Analyzing and Comparing Lakehouse Storage Systems. In Conference on Innovative Data Systems Research (CIDR), 1–12. Retrieved from: https:// mail.vldb.org/cidrdb/papers/2023/p92-jain.pdf

  13. [17]

    Apache Iceberg. (2024). Apache Iceberg Documentation (NightlyDocs). Retrieved from: https://iceberg. apache.org/docs/nightly/ 36 Information Technology: Computer Science, Software Engineering and Cyber Security, Вип. 4, 2025

  14. [18]

    Apache Hudi. (2024). Apache Hudi Documentation: Overview. Retrieved from: https://hudi.apache.org/ docs/overview

  15. [19]

    Delta Lake Project. (2024). Delta Lake Documentation. Retrieved from: https://docs.delta.io

  16. [20]

    Janssen, N., Ilayperuma, T., Jayasinghe, J., et al. (2024). The evolution of data storage architectures: examining the secure value of the Data Lakehouse. Journal of Data, Information and Management, 6, 309–334. doi:10.1007/s42488-024-00132-1

  17. [21]

    Begoli, E., Goethert, I., & Knight, K. (2021). A Lakehouse Architecture for the Management and Analysis of Heterogeneous Data for Biomedical Research and Mega-biobanks. In IEEE International Conference on Big Data, 967–975. doi:10.1109/BigData52589.2021.9671534

  18. [22]

    Drohobytskiy, Y., Brevus, V., & Skorenkyy, Y. (2020). Spark Structured Streaming: Customizing Kafka Stream Processing. In 2020 IEEE Third International Conference on Data Stream Mining & Processing (DSMP), 296–299. doi:10.1109/DSMP47368.2020.9204304

  19. [23]

    Fedorovych, I., Osukhivska, H., & Lutsyk, N. (2024). Performance Benchmarking of Continuous Processing and Micro-Batch Modes in Spark Structured Streaming. In ITTAP’2024: 4th International Workshop on Information Technologies: Theoretical and Applied Problems, 80–90. URL: https://ceur-ws.org/Vol-3896/ paper5.pdf

  20. [24]

    Manchana, R. (2023). Building a Modern Data Foundation in the Cloud: Data Lakes and Data Lakehouses as Key Enablers. Journal of Artificial Intelligence, Machine Learning and Data Science, 1(1), 1098–1108. doi:10.51219/JAIMLD/Ramakrishna-manchana/260

  21. [25]

    Borodii, I., Fedorovych, I., Osukhivska, H., Velychko, D., & Butsii, R. (2025). Comparative analysis of large data processing in Apache Spark using Java, Python and Scala. In Proceedings of the 3rd International Workshop on Computer Information Technologies in Industry 4.0 (CITI 2025), 189–198. Retrieved from: https:// ceur-ws.org/Vol-4057/paper13.pdf

  22. [26]

    Open-Meteo. (2024). Weather Forecast API. Retrieved from: https://open-meteo.com/en/docs

  23. [27]

    Simplemaps. (2024). World Cities Database. Retrieved from: https://simplemaps.com/data/world-cities

  24. [28]

    IQAir. (2024). Air Visual API Documentation. Retrieved from: https://api-docs.iqair.com/?version=latest Дата надходження статті: 03.11.2025 Дата прийняття статті: 10.12.2025 Опубліковано: 30.12.2025