Research on the efficiency of data loading and storage in Data Lakehouse architectures for the formation of analytical data systems
Pith reviewed 2026-05-08 14:02 UTC · model grok-4.3
The pith
Delta Lake loads data fastest while Apache Iceberg uses the least disk space among three common Lakehouse systems.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Experiments with structured and semi-structured data show Delta Lake completing loads in the shortest time regardless of volume, while Apache Iceberg consistently produces the smallest tables on disk and maintains stable behavior. Apache Hudi records longer load times and larger storage footprints in the same tasks. The study concludes that Delta Lake is the preferred architecture when loading speed is the main requirement, and Apache Iceberg is preferred when disk space savings and stability matter most.
What carries the argument
Four sequential ETL processes that read, transform, and write data into each Lakehouse table format, evaluated by load completion time and resulting table size in the file system.
Load-bearing premise
Performance measured on files up to 7 GB and four fixed ETL steps is enough to decide which architecture is optimal for analytical data systems in general.
What would settle it
Repeating the loads on a dataset larger than 7 GB or on a production-scale Spark cluster where Iceberg finishes faster than Delta Lake would show the speed ranking does not hold.
Figures
read the original abstract
The paper presents a study of the efficiency of loading and storing data in the three most common Data Lakehouse systems, including Apache Hudi, Apache Iceberg, and Delta Lake, using Apache Spark as a distributed data processing platform. The study analyzes the behavior of each system when processing structured (CSV) and semi-structured (JSON) data of different sizes, including loading files up to 7 GB in size. The purpose of the work is to determine the most optimal Data Lakehouse architecture based on the type and volume of data sources, data loading performance using Apache Spark, and disk size of data for forming analytical data systems. The research covers the development of four sequential ETL processes, which include reading, transforming, and loading data into tables in each of the Data Lakehouse systems. The efficiency of each Lakehouse was evaluated according to two key criteria: data loading time and the volume of tables formed in the file system. For the first time, a comparison of performance and data storage in Apache Iceberg, Apache Hudi, and Delta Lake Data Lakehouse systems was conducted to select the most relevant architecture for building analytical data systems. The practical value of the study consists in the fact that it assists data engineers and architects in choosing the most appropriate Lakehouse architecture, understanding the balance between loading performance and storage efficiency. Experimental results showed that Delta Lake is the most optimal architecture for systems where the priority is the speed of loading data of any volume, while Apache Iceberg is most appropriate for systems where stability and disk space savings are critical. Apache Hudi proved ineffective in data loading and storage evaluation tasks but could potentially be effective in incremental update and streaming processing scenarios.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript reports an experimental study comparing the data loading performance and storage efficiency of Apache Hudi, Apache Iceberg, and Delta Lake using Apache Spark for CSV and JSON datasets up to 7 GB. Through four sequential ETL processes, it measures loading times and final table sizes, concluding that Delta Lake offers the best loading speed for data of any volume, Iceberg provides superior stability and disk space savings, and Hudi is less effective for these tasks but may suit incremental updates.
Significance. If the reported performance differences hold under broader conditions, the work could assist practitioners in selecting appropriate Data Lakehouse architectures for analytical systems by balancing loading speed against storage efficiency. The empirical data on small-scale workloads adds to the limited body of comparative studies in this area.
major comments (3)
- [Abstract] The claim in the abstract that Delta Lake is the most optimal architecture for the speed of loading data of any volume is not supported by the experiments, which are limited to data sizes up to 7 GB with only four sequential ETL processes and no larger-scale runs, concurrent writers, update/merge workloads, or failure-injection tests.
- [Results] Stability is asserted as a key advantage for Apache Iceberg without an operational definition or any quantitative measurement; the evaluation criteria are restricted to loading time and final table volume, leaving the stability claim unsupported.
- [Experimental Setup] No details are provided on hardware specifications, number of runs, error bars, or statistical significance, which undermines assessment of whether the measured differences reliably support the optimality rankings.
minor comments (1)
- [Abstract and Introduction] The abstract and introduction repeat the study purpose and evaluation criteria multiple times; condensing this would improve readability.
Simulated Author's Rebuttal
Thank you for the constructive referee report. We address each major comment below, indicating planned revisions to align claims with experimental scope and strengthen the manuscript.
read point-by-point responses
-
Referee: [Abstract] The claim in the abstract that Delta Lake is the most optimal architecture for the speed of loading data of any volume is not supported by the experiments, which are limited to data sizes up to 7 GB with only four sequential ETL processes and no larger-scale runs, concurrent writers, update/merge workloads, or failure-injection tests.
Authors: We agree that the experiments cover only datasets up to 7 GB using four sequential ETL processes and do not include larger scales, concurrent writers, updates, merges, or failure tests. The phrasing 'data of any volume' extrapolates beyond the tested conditions. In revision, we will update the abstract to state that Delta Lake showed the fastest loading for the evaluated volumes up to 7 GB, and we will add an explicit limitations paragraph in the discussion section noting the restricted scope and the value of future larger-scale validation. revision: yes
-
Referee: [Results] Stability is asserted as a key advantage for Apache Iceberg without an operational definition or any quantitative measurement; the evaluation criteria are restricted to loading time and final table volume, leaving the stability claim unsupported.
Authors: The observation is accurate: our evaluation criteria were limited to loading time and final table size, with no operational definition or quantitative metrics for stability. References to 'superior stability' for Iceberg were informal and not data-driven. We will remove all claims about stability advantages from the abstract, results, and conclusions, restricting statements to the two measured criteria only. revision: yes
-
Referee: [Experimental Setup] No details are provided on hardware specifications, number of runs, error bars, or statistical significance, which undermines assessment of whether the measured differences reliably support the optimality rankings.
Authors: We will revise the Experimental Setup section to specify the hardware configuration (Spark cluster CPU, memory, and storage details), the number of runs executed per configuration, inclusion of error bars or standard deviations on reported times and sizes, and any statistical comparisons performed. These additions will enable readers to assess result reliability directly. revision: yes
Circularity Check
No circularity: claims rest solely on direct experimental measurements with no derivations or self-referential reductions.
full rationale
The paper conducts straightforward benchmarking of Apache Hudi, Iceberg, and Delta Lake via Spark on CSV/JSON inputs up to 7 GB using four sequential ETL steps. It reports measured loading times and final table sizes, then states conclusions about optimality for speed vs. stability/storage. No equations, fitted parameters, predictions, or mathematical derivations appear. No self-citations are invoked as load-bearing premises for uniqueness or ansatzes. The central claims do not reduce to their inputs by construction; they are empirical observations open to external replication or falsification. This matches the default expectation of no significant circularity for an experimental comparison paper.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption The chosen data formats (CSV, JSON) and sizes up to 7 GB represent typical workloads for analytical data systems.
- domain assumption Apache Spark provides a fair and consistent platform for comparing the three Lakehouse systems.
Reference graph
Works this paper leans on
-
[1]
Armbrust M., Das T., Zhu S., Hernandez I., Xin R., Bradley J. K., Zaharia M. Lakehouse: A new generation of open platforms that unify data warehousing and advanced analytics. Databricks Blog, 2021. URL: https:// www.databricks.com/blog/2020/01/30/what-is-a-data-lakehouse.html
work page 2021
-
[2]
Analyzing and Comparing Lakehouse Storage Systems
Jain P., Kraft P., Power C., Das T., Stoica I., Zaharia M. Analyzing and Comparing Lakehouse Storage Systems. In: Conference on Innovative Data Systems Research (CIDR) , 2023, pp. 1–12. URL: https://mail. vldb.org/cidrdb/papers/2023/p92-jain.pdf
work page 2023
-
[3]
Apache Iceberg Documentation. NightlyDocs. URL: https://iceberg.apache.org/docs/nightly/
- [4]
- [5]
-
[9]
Fedorovych I., Osukhivska H., Lutsyk N. Performance Benchmarking of Continuous Processing and Micro-Batch Modes in Spark Structured Streaming. In: ITTAP 2024: 4th International Workshop on Information Technologies: Theoretical and Applied Problems , 20–22 November 2024, Ternopil, Ukraine, Opole, Poland, pp. 80–90. URL: https://ceur-ws.org/Vol-3896/paper5.pdf
work page 2024
-
[11]
Comparative analysis of large data processing in Apache Spark using Java, Python and Scala
Borodii I., Fedorovych I., Osukhivska H., Velychko D., Butsii R. Comparative analysis of large data processing in Apache Spark using Java, Python and Scala. In: Proceedings of the 3rd International Workshop on Computer Information Technologies in Industry 4.0 (CITI 2025) , 11–12 June 2025, Ternopil, Ukraine, pp. 189–198. URL: https://ceur-ws.org/Vol-4057/...
work page 2025
-
[12]
URL: https://open-meteo.com/en/docs
Open-Meteo: Weather Forecast API. URL: https://open-meteo.com/en/docs
-
[13]
URL: https://simplemaps.com/data/world-cities
Simplemaps: World Cities Database. URL: https://simplemaps.com/data/world-cities
-
[14]
URL: https://api-docs.iqair.com/?version=latest REFERENCES:
IQAir: Air Visual API. URL: https://api-docs.iqair.com/?version=latest REFERENCES:
-
[15]
Armbrust, M., Das, T., Zhu, S., Hernandez, I., Xin, R., Bradley, J. K., & Zaharia, M. (2021). Lakehouse: A new generation of open platforms that unify data warehousing and advanced analytics. Databricks Blog. Retrieved from: https://www.databricks.com/blog/2020/01/30/what-is-a-data-lakehouse.html
work page 2021
-
[16]
Jain, P., Kraft, P., Power, C., Das, T., Stoica, I., & Zaharia, M. (2023). Analyzing and Comparing Lakehouse Storage Systems. In Conference on Innovative Data Systems Research (CIDR), 1–12. Retrieved from: https:// mail.vldb.org/cidrdb/papers/2023/p92-jain.pdf
work page 2023
-
[17]
Apache Iceberg. (2024). Apache Iceberg Documentation (NightlyDocs). Retrieved from: https://iceberg. apache.org/docs/nightly/ 36 Information Technology: Computer Science, Software Engineering and Cyber Security, Вип. 4, 2025
work page 2024
-
[18]
Apache Hudi. (2024). Apache Hudi Documentation: Overview. Retrieved from: https://hudi.apache.org/ docs/overview
work page 2024
-
[19]
Delta Lake Project. (2024). Delta Lake Documentation. Retrieved from: https://docs.delta.io
work page 2024
-
[20]
Janssen, N., Ilayperuma, T., Jayasinghe, J., et al. (2024). The evolution of data storage architectures: examining the secure value of the Data Lakehouse. Journal of Data, Information and Management, 6, 309–334. doi:10.1007/s42488-024-00132-1
-
[21]
Begoli, E., Goethert, I., & Knight, K. (2021). A Lakehouse Architecture for the Management and Analysis of Heterogeneous Data for Biomedical Research and Mega-biobanks. In IEEE International Conference on Big Data, 967–975. doi:10.1109/BigData52589.2021.9671534
-
[22]
Drohobytskiy, Y., Brevus, V., & Skorenkyy, Y. (2020). Spark Structured Streaming: Customizing Kafka Stream Processing. In 2020 IEEE Third International Conference on Data Stream Mining & Processing (DSMP), 296–299. doi:10.1109/DSMP47368.2020.9204304
-
[23]
Fedorovych, I., Osukhivska, H., & Lutsyk, N. (2024). Performance Benchmarking of Continuous Processing and Micro-Batch Modes in Spark Structured Streaming. In ITTAP’2024: 4th International Workshop on Information Technologies: Theoretical and Applied Problems, 80–90. URL: https://ceur-ws.org/Vol-3896/ paper5.pdf
work page 2024
-
[24]
Manchana, R. (2023). Building a Modern Data Foundation in the Cloud: Data Lakes and Data Lakehouses as Key Enablers. Journal of Artificial Intelligence, Machine Learning and Data Science, 1(1), 1098–1108. doi:10.51219/JAIMLD/Ramakrishna-manchana/260
-
[25]
Borodii, I., Fedorovych, I., Osukhivska, H., Velychko, D., & Butsii, R. (2025). Comparative analysis of large data processing in Apache Spark using Java, Python and Scala. In Proceedings of the 3rd International Workshop on Computer Information Technologies in Industry 4.0 (CITI 2025), 189–198. Retrieved from: https:// ceur-ws.org/Vol-4057/paper13.pdf
work page 2025
-
[26]
Open-Meteo. (2024). Weather Forecast API. Retrieved from: https://open-meteo.com/en/docs
work page 2024
-
[27]
Simplemaps. (2024). World Cities Database. Retrieved from: https://simplemaps.com/data/world-cities
work page 2024
-
[28]
IQAir. (2024). Air Visual API Documentation. Retrieved from: https://api-docs.iqair.com/?version=latest Дата надходження статті: 03.11.2025 Дата прийняття статті: 10.12.2025 Опубліковано: 30.12.2025
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.