pith. sign in

arxiv: 1907.06723 · v1 · pith:3GTIJALRnew · submitted 2019-07-15 · 💻 cs.DC · cs.DB

DOD-ETL: Distributed On-Demand ETL for Near Real-Time Business Intelligence

Pith reviewed 2026-05-24 21:01 UTC · model grok-4.3

classification 💻 cs.DC cs.DB
keywords near real-time ETLdistributed data processingstream processingbusiness intelligencedata pipelinein-memory cachingdata partitioning
0
0 comments X

The pith

DOD-ETL performs near real-time ETL up to 10 times faster than other stream processing frameworks through its on-demand distributed pipeline.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents DOD-ETL as a solution to the slow ETL bottleneck that prevents timely business intelligence. It combines an on-demand streaming pipeline, distributed parallel processing, in-memory caching, and data partitioning into a technology-independent system. This setup delivers workloads up to 10 times faster than existing frameworks and was deployed in a steelworks to enable previously unavailable near real-time reports.

Core claim

DOD-ETL addresses the main bottleneck in Business Intelligence solutions, the Extract Transform Load process, by providing it in near real-time. It achieves this by combining an on-demand data stream pipeline with a distributed, parallel and technology-independent architecture with in-memory caching and efficient data partitioning. Comparisons with other Stream Processing frameworks show DOD-ETL executes workloads up to 10 times faster. Deployment in a large steelworks replaced its previous ETL solution and enabled near real-time reports previously unavailable.

What carries the argument

on-demand data stream pipeline with distributed parallel architecture, in-memory caching, and efficient data partitioning

If this is right

  • ETL ceases to be the primary delay in turning data into actionable business information.
  • Existing stream processing tools can be replaced in large operations to support faster reporting.
  • The technology-independent design allows the same pipeline to run across varied computing setups.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same architecture could be tested in other high-volume data environments such as finance or logistics to check if similar speed gains appear.
  • Further scaling experiments would clarify whether in-memory caching remains effective as data volumes grow beyond the steelworks case.
  • The partitioning method might reduce costs in cloud deployments by lowering the need for constant resource allocation.

Load-bearing premise

The on-demand streaming pipeline with distributed architecture, caching, and partitioning can be realized in production without hidden bottlenecks or correctness issues, as shown only in one steelworks deployment.

What would settle it

A head-to-head performance test on a different large industrial dataset or workload where DOD-ETL does not achieve the reported speedup or encounters data errors would falsify the central performance claim.

Figures

Figures reproduced from arXiv: 1907.06723 by Adriano C. M. Pereira, Gustavo V. Machado, \'Italo Cunha, Leonardo B. Oliveira.

Figure 1
Figure 1. Figure 1: Batch vs. Near real-time ETL. Sabtu et al. [27] enumerate several problems related to near real-time ETL and, along with Ellis [8], they provide some directions and possible solutions to each prob￾lem. However, due to these problems complexity, ETL solutions do not always address them directly: to avoid affecting efficiency on transaction databases, ETL processes were usually run in batches and off-hours (… view at source ↗
Figure 2
Figure 2. Figure 2: DOD-ETL workflow step by step. All steps depend on configuration parameters to work properly. Thus, during DOD-ETL’s deployment, it is imperative to go through a configuration process, where decisions are made to set the following parameters: tables to extract—define which ta￾bles will have data extracted from; table nature—from the defined tables, detail which ones are operational (constantly updated) and… view at source ↗
Figure 3
Figure 3. Figure 3: Data splitting working on metals industry context. [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: In-memory cache initialization overhead. [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Scalability: Listener experiment result. [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Scalability: Stream Processor experiment result. [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗
read the original abstract

The competitive dynamics of the globalized market demand information on the internal and external reality of corporations. Information is a precious asset and is responsible for establishing key advantages to enable companies to maintain their leadership. However, reliable, rich information is no longer the only goal. The time frame to extract information from data determines its usefulness. This work proposes DOD-ETL, a tool that addresses, in an innovative manner, the main bottleneck in Business Intelligence solutions, the Extract Transform Load process (ETL), providing it in near real-time. DODETL achieves this by combining an on-demand data stream pipeline with a distributed, parallel and technology-independent architecture with in-memory caching and efficient data partitioning. We compared DOD-ETL with other Stream Processing frameworks used to perform near real-time ETL and found DOD-ETL executes workloads up to 10 times faster. We have deployed it in a large steelworks as a replacement for its previous ETL solution, enabling near real-time reports previously unavailable.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The manuscript proposes DOD-ETL, a tool for near real-time ETL in business intelligence. It uses an on-demand data stream pipeline combined with a distributed parallel architecture, in-memory caching, and data partitioning to overcome traditional ETL bottlenecks. The central claim is that DOD-ETL executes workloads up to 10 times faster than other stream processing frameworks, supported by a production deployment in a large steelworks that enabled previously unavailable near real-time reports.

Significance. If the speedup and production claims can be substantiated through controlled experiments, the approach could meaningfully advance practical near real-time BI systems in industrial environments by reducing ETL latency. The architecture elements address a recognized pain point, but the manuscript provides no reproducible evidence that the techniques deliver the attributed gains.

major comments (3)
  1. [Abstract] Abstract: the claim that 'DOD-ETL executes workloads up to 10 times faster' is presented without naming the compared frameworks, describing the workloads, hardware/network configuration, measurement protocol, or any error bars, rendering the central performance result unverifiable and load-bearing for the paper's contribution.
  2. [Abstract] Abstract (steelworks deployment paragraph): the replacement of the previous ETL solution is described only anecdotally with no quantitative before/after metrics, workload characteristics, or implementation details, so the assertion that the architecture enables 'near real-time reports previously unavailable' rests on a single uncontrolled case study.
  3. [Abstract] Abstract: no discussion or evidence is supplied that the on-demand pipeline, in-memory caching, and partitioning avoid hidden bottlenecks or correctness issues in production, which is required to attribute any observed difference to the proposed techniques rather than implementation or data artifacts.
minor comments (1)
  1. [Abstract] The abstract contains several long, general sentences about market dynamics that could be shortened without loss of technical content.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for the detailed feedback on the abstract. We agree that greater specificity is required for the performance claims and will revise the abstract to improve verifiability. The production deployment description is constrained by confidentiality, limiting quantitative disclosure.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that 'DOD-ETL executes workloads up to 10 times faster' is presented without naming the compared frameworks, describing the workloads, hardware/network configuration, measurement protocol, or any error bars, rendering the central performance result unverifiable and load-bearing for the paper's contribution.

    Authors: The abstract summarizes results whose details—including compared frameworks (Apache Spark Streaming and Apache Flink), workloads, hardware/network setup, measurement protocol, and error bars—are presented in Section 5. We will revise the abstract to name the frameworks and briefly note the experimental conditions to make the claim more self-contained. revision: yes

  2. Referee: [Abstract] Abstract (steelworks deployment paragraph): the replacement of the previous ETL solution is described only anecdotally with no quantitative before/after metrics, workload characteristics, or implementation details, so the assertion that the architecture enables 'near real-time reports previously unavailable' rests on a single uncontrolled case study.

    Authors: Section 6 provides additional implementation context on the integration. Quantitative before/after metrics cannot be released due to non-disclosure agreements with the partner. We will partially revise the abstract to clarify the qualitative outcome (enabling previously unavailable reports due to latency) while noting the case-study nature. revision: partial

  3. Referee: [Abstract] Abstract: no discussion or evidence is supplied that the on-demand pipeline, in-memory caching, and partitioning avoid hidden bottlenecks or correctness issues in production, which is required to attribute any observed difference to the proposed techniques rather than implementation or data artifacts.

    Authors: Sections 3 and 4 explain the design rationale for the on-demand pipeline, caching, and partitioning to mitigate bottlenecks and maintain correctness. We will add a brief reference in the abstract to these sections to better link observed gains to the techniques. revision: yes

standing simulated objections not resolved
  • Quantitative before/after metrics and workload characteristics from the steelworks deployment, restricted by confidentiality agreements.

Circularity Check

0 steps flagged

No circularity; empirical claims only

full rationale

The paper contains no equations, derivations, fitted parameters, or first-principles results. Its 10x speedup claim is stated as the outcome of direct empirical comparisons against other frameworks plus one production deployment; these are presented as measurements rather than quantities defined in terms of the paper's own inputs. No self-citation load-bearing steps, ansatzes, or renamings appear. The work is therefore self-contained against external benchmarks and receives the default non-finding.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review reveals no explicit free parameters, mathematical axioms, or newly postulated entities; the contribution is an engineering architecture whose correctness rests on unstated implementation assumptions.

pith-pipeline@v0.9.0 · 5713 in / 1048 out tokens · 22123 ms · 2026-05-24T21:01:32.216764+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages

  1. [1]

    Apache Beam

    Apache. Apache Beam. https://beam.apache.org/, 2015

  2. [2]

    Azvine, Z

    B. Azvine, Z. Cui, D. D. Nauck, and B. Majeed. Real time business intelligence for the adaptive enterprise. In E-Commerce Technology, 2006. The 8th IEEE Interna- tional Conference on and Enterprise Computing, E-Commerce, and E-Services, The 3rd IEEE International Conference on, pages 29–29. IEEE, 2006

  3. [3]

    M. A. Bornea, A. Deligiannakis, Y . Kotidis, and V . Vassalos. Semi-streamed index join for near-real time execution of etl transformations. InData Engineering (ICDE), 2011 IEEE 27th International Conference on, pages 159–170. IEEE, 2011

  4. [4]

    Carbone, A

    P. Carbone, A. Katsifodimos, S. Ewen, V . Markl, S. Haridi, and K. Tzoumas. Apache flink: Stream and batch processing in a single engine.Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, 36(4), 2015

  5. [5]

    E. F. Codd, S. B. Codd, and C. T. Salley. Providing olap (on-line analytical process- ing) to user-analysts: An it mandate. Codd and Date, 32, 1993

  6. [6]

    D. Cutting. Apache Avro. https://avro.apache.org/, 2009

  7. [7]

    Dean and S

    J. Dean and S. Ghemawat. Mapreduce: simplified data processing on large clusters. Communications of the ACM, 51(1):107–113, 2008

  8. [8]

    B. Ellis. Real-time analytics: Techniques to analyze and visualize streaming data . John Wiley & Sons, 2014

  9. [9]

    W. A. Giovinazzo. Object-oriented data warehouse design: building a star schema. Prentice Hall PTR, 2000

  10. [10]

    Google Dataflow

    Google. Google Dataflow. https://cloud.google.com/dataflow/, 2015

  11. [11]

    P. Hunt, M. Konar, F. P. Junqueira, and B. Reed. Zookeeper: Wait-free coordina- tion for internet-scale systems. In USENIX annual technical conference, volume 8, page 9. Boston, MA, USA, 2010

  12. [12]

    Enterprise-Control System Integration Part 2 : Object Model Attributes

    International Society of Automation. Enterprise-Control System Integration Part 2 : Object Model Attributes. Isa, 2001

  13. [13]

    T. Jain, S. Rajasree, and S. Saluja. Refreshing datawarehouse in near real-time. International Journal of Computer Applications, 46(18):24–29, 2012

  14. [14]

    Karakasidis, P

    A. Karakasidis, P. Vassiliadis, and E. Pitoura. Etl queues for active data warehous- ing. In Proceedings of the 2nd international workshop on Information quality in information systems, pages 28–39. ACM, 2005

  15. [15]

    Kreps, N

    J. Kreps, N. Narkhede, J. Rao, et al. Kafka: A distributed messaging system for log processing. In Proceedings of the NetDB, pages 1–7, 2011

  16. [16]

    Ljungberg

    ˜O. Ljungberg. Measurement of overall equipment effectiveness as a basis for tpm activities. International Journal of Operations & Production Management , 18(5): 495–507, 1998

  17. [17]

    Malhotra

    Y . Malhotra. From information management to knowledge management. beyond the’hi-tech hidebound’systems. Knowledge management and business model inno- vation, pages 115–134, 2001

  18. [18]

    Mesiti, L

    M. Mesiti, L. Ferrari, S. Valtolina, G. Licari, G. Galliani, M. Dao, K. Zettsu, et al. Streamloader: an event-driven etl system for the on-line processing of heterogeneous sensor data. In Extending Database Technology, pages 628–631. OpenProceedings, 2016

  19. [19]

    Azure Stream Analytics

    Microsoft. Azure Stream Analytics. https://azure.microsoft.com/ en-us/services/stream-analytics/, 2015

  20. [20]

    T. Mueller. H2 Database. http://www.h2database.com/, 2012

  21. [21]

    M. A. Naeem, G. Dobbie, and G. Webber. An event-based near real-time data inte- gration architecture. In Enterprise Distributed Object Computing Conference Work- shops, 2008 12th, pages 401–404. IEEE, 2008

  22. [22]

    M. A. Naeem, G. Dobbie, G. Weber, and S. Alam. R-meshjoin for near-real-time data warehousing. In Proceedings of the ACM 13th international workshop on Data warehousing and OLAP, pages 53–60. ACM, 2010

  23. [23]

    Neumeyer, B

    L. Neumeyer, B. Robbins, A. Nair, and A. Kesari. S4: Distributed stream com- puting platform. In Data Mining Workshops (ICDMW), 2010 IEEE International Conference on, pages 170–177. IEEE, 2010

  24. [24]

    T. M. Nguyen, J. Schiefer, and A. M. Tjoa. Sense & response service architec- ture (saresa): an approach towards a real-time business intelligence solution and its use for a fraud detection application. In Proceedings of the 8th ACM international workshop on Data warehousing and OLAP, pages 77–86. ACM, 2005

  25. [25]

    Polyzotis, S

    N. Polyzotis, S. Skiadopoulos, P. Vassiliadis, A. Simitsis, and N.-E. Frantzell. Sup- porting streaming updates in an active data warehouse. In Data Engineering, 2007. ICDE 2007. IEEE 23rd International Conference on, pages 476–485. IEEE, 2007

  26. [26]

    Polyzotis, S

    N. Polyzotis, S. Skiadopoulos, P. Vassiliadis, A. Simitsis, and N. Frantzell. Meshing streaming updates with persistent data in an active data warehouse. IEEE Transac- tions on Knowledge and Data Engineering, 20(7):976–991, 2008

  27. [27]

    Sabtu, N

    A. Sabtu, N. F. M. Azmi, N. N. A. Sjarif, S. A. Ismail, O. M. Yusop, H. Sarkan, and S. Chuprat. The challenges of extract, transform and loading (etl) system implemen- tation for near real-time environment. In Research and Innovation in Information Systems (ICRIIS), 2017 International Conference on, pages 1–5. IEEE, 2017

  28. [28]

    Sahay and J

    B. Sahay and J. Ranjan. Real time business intelligence in supply chain analytics. Information Management & Computer Security, 16(1):28–48, 2008

  29. [29]

    Stamatis

    D. Stamatis. The OEE Primer: Understanding Overall Equipment Effectiveness, Reliability, and Maintainability. Productivity Press, 1 pap/cdr edition, 6 2010. ISBN 9781439814062. URL http://amazon.com/o/ASIN/1439814066/

  30. [30]

    Thalhammer, M

    T. Thalhammer, M. Schrefl, and M. Mohania. Active data warehouses: comple- menting olap with analysis rules. Data & Knowledge Engineering, 39(3):241–269, 2001

  31. [31]

    Toshniwal, S

    A. Toshniwal, S. Taneja, A. Shukla, K. Ramasamy, J. M. Patel, S. Kulkarni, J. Jack- son, K. Gade, M. Fu, J. Donham, et al. Storm@ twitter. In Proceedings of the 2014 ACM SIGMOD international conference on Management of data , pages 147–156. ACM, 2014

  32. [32]

    Vassiliadis and A

    P. Vassiliadis and A. Simitsis. Near real time etl. In New trends in data warehousing and data analysis, pages 1–31. Springer, 2009

  33. [33]

    F. Waas, R. Wrembel, T. Freudenreich, M. Thiele, C. Koncilia, and P. Furtado. On- demand elt architecture for right-time bi: extending the vision.International Journal of Data Warehousing and Mining (IJDWM), 9(2):21–38, 2013

  34. [34]

    H. J. Watson and B. H. Wixom. The current state of business intelligence.Computer, 40(9), 2007

  35. [35]

    A. Wibowo. Problems and available solutions on the stage of extract, transform, and loading in near real-time data warehousing (a literature study). In Intelligent Technology and Its Applications (ISITIA), 2015 International Seminar on , pages 345–350. IEEE, 2015

  36. [36]

    Zaharia, T

    M. Zaharia, T. Das, H. Li, S. Shenker, and I. Stoica. Discretized streams: An efficient and fault-tolerant model for stream processing on large clusters. HotCloud, 12:10– 10, 2012

  37. [37]

    Zhang, J

    F. Zhang, J. Cao, S. U. Khan, K. Li, and K. Hwang. A task-level adaptive mapreduce framework for real-time streaming data in healthcare applications. Future Genera- tion Computer Systems, 43:149–160, 2015