Koalja: from Data Plumbing to Smart Workspaces in the Extended Cloud

Ewout Prangsma; Mark Burgess

arxiv: 1907.01796 · v1 · pith:ULWO5VFHnew · submitted 2019-07-03 · 💻 cs.DC · cs.MA· cs.NI

Koalja: from Data Plumbing to Smart Workspaces in the Extended Cloud

Mark Burgess , Ewout Prangsma This is my paper

Pith reviewed 2026-05-25 10:12 UTC · model grok-4.3

classification 💻 cs.DC cs.MAcs.NI

keywords data pipelinesKubernetesserverlessdata provenanceedge computingsustainabilitycloud computingmetadata tracking

0 comments

The pith

Koalja builds a data pipeline platform on Kubernetes that hides infrastructure details while tracking provenance and optimizing energy use.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Koalja as a platform for wiring data pipelines on top of Kubernetes that keeps the underlying system invisible to users. It captures detailed metadata during data flow to enable complete tracing of origins and software versions. This design supports starting with simple development setups and moving them to production systems using little infrastructure expertise. Optimizations reduce unnecessary data movement and processing to address growing sustainability needs in cloud and edge settings.

Core claim

Koalja describes a generalized data wiring or pipeline platform, built on top of Kubernetes, for plugin user code. Koalja makes the Kubernetes underlay transparent to users for a serverless experience, and offers a breadboarding experience for development of data sharing circuitry, to commoditize its gradual promotion to a production system, with a minimum of infrastructure knowledge. Enterprise grade metadata are captured as data payloads flow through the circuitry, allowing full tracing of provenance and forensic reconstruction of transactional processes, down to the versions of software that led to each outcome. Koalja attends to optimizations for avoiding unwanted processing and t

What carries the argument

Koalja's generalized data wiring platform that renders the Kubernetes underlay transparent while capturing enterprise metadata for provenance and applying flow optimizations.

Load-bearing premise

Hiding Kubernetes details while adding provenance tracking and energy optimizations will let users with minimal infrastructure knowledge move breadboarded data systems into production.

What would settle it

A team with no prior Kubernetes experience attempts to build and deploy a multi-stage pipeline solely through Koalja and then checks whether complete provenance records down to software versions can be retrieved for every output.

Figures

Figures reproduced from arXiv: 1907.01796 by Ewout Prangsma, Mark Burgess.

**Figure 1.** Figure 1: A skeletal classification of data pipe processing. • Storage where actual data batches can be kept and cached for possible re-use, in all intermediate stages (as with a Makefile process). • A pipeline manager that handles registration of processes, scheduling of work and assembly of metadata. C. Traceability On the tracing side, there are three kinds of story we want to be able to tell about data process… view at source ↗

**Figure 2.** Figure 2: 3 Views. Travelling passport documents, versus logs of entry and exit from a checkpoint, versus the map of checkpoints and routes. 3 [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: A pipeline is formed from ‘data wiring’ that forms sequences passing data between transformations supplied by user code. These transformations may rely on exterior services, and the certainly rely on external storage. database ingress filedrop link agent task agent link agent task agent link agent egress filedrop AV AV AV AV AV AV pipeline registry Processing channel Metadata channel Message channel Storag… view at source ↗

**Figure 4.** Figure 4: Architectural elements of Koalja, showing the three main kind of agent: tasks, links, and the pipeline registry. of the channels used for web traffic and interprocess communication). This simplifies the decision about where to store data within a homogeneous region of cloud infrastructure, like a datacentre, but it still leaves open what to do about regional network deficiencies. Again, we opt to bet on t… view at source ↗

**Figure 5.** Figure 5: A basic input language for describing process [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: 4 An example pair of pipelines, represented as a single data circuit. The upper pipeline shows a training process for a Tensor Flow neural network, which is deployed as a service consulted by the lower pipeline. The lower pipeline receives sample images to be recognized and classified according the machine learning model trained by the upper pipeline. The implicit link between the two pipelines is shown by… view at source ↗

**Figure 7.** Figure 7: Aggregation of data from multiple input channels, e.g. collecting data from multiple weather sensors for a complete sample set. Some sensors (e.g. wind speed) may take longer to arrive than others (e.g. temperature). Should the pipeline wait for all the data, several repeated measurements (as an time-series of values or as a sliding window). There are several common possibilities for coordinating and comp… view at source ↗

**Figure 8.** Figure 8: As data are shifted around processes in possibly parallel pipelines, formed from elastically scaled agents running software, which is changing in real time, we need to be able to see the causal travel documents of the data to know exactly what led to outcomes. In the spirit of breadboarding, we expect that users with need or want to experiment before building a robust business process. It’s in the nature … view at source ↗

**Figure 9.** Figure 9: A Local checkpoint log, with interleaving and branching timelines, as discussed in [16]. <begin NON-LOCAL CAUSE> (program start) --b(precedes)--> "MainLoop start" (MainLoop start) --b(precedes)--> "Beginning of test code" (Beginning of test code) --b(precedes)--> "code signpost X" (code signpost X) --b(precedes)--> "Run ps command" (code signpost X) --b(precedes)--> "TEST1---------" (code signpost X) --b(p… view at source ↗

**Figure 10.** Figure 10: An invariant concept map of an instrumented data pipeline that may include user code, as described in [16]. described in [16]. A library of access points for users to equip their own code with logging functions, to integrate into the big picture, is easily done should they be so inclined (see [16]). One can also rely on the smart wrappers to do the heavy lifting. This may seem like a trivial matter, but p… view at source ↗

**Figure 11.** Figure 11: The virtual access boundary for a data wiring may be based on many criteria, as a matter of policy. Sometimes we may want the same process to span different geographical regions, some even in motion with respect to others. Other times, we may want to separate data by function in the same room. In an abstract workspace, users would be able access shared data, but simultaneously protect it from wider releas… view at source ↗

**Figure 12.** Figure 12: The virtual access boundary for a data wiring may be based on many criteria, as a matter of policy. Sometimes we may want the same process to span different geographical regions. Other times we may want to separate data by function in the same room. and what makes them tick, so to speak. There is much more to say on these issues at a later date. V. SUMMARY Our conception of data plumbing is very much like… view at source ↗

read the original abstract

Koalja describes a generalized data wiring or `pipeline' platform, built on top of Kubernetes, for plugin user code. Koalja makes the Kubernetes underlay transparent to users (for a `serverless' experience), and offers a breadboarding experience for development of data sharing circuitry, to commoditize its gradual promotion to a production system, with a minimum of infrastructure knowledge. Enterprise grade metadata are captured as data payloads flow through the circuitry, allowing full tracing of provenance and forensic reconstruction of transactional processes, down to the versions of software that led to each outcome. Koalja attends to optimizations for avoiding unwanted processing and transportation of data, that are rapidly becoming sustainability imperatives. Thus one can minimize energy expenditure and waste, and design with scaling in mind, especially with regard to edge computing, to accommodate an Internet of Things, Network Function Virtualization, and more.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Koalja describes a Kubernetes data pipeline system aimed at transparency, provenance capture, and sustainability optimizations, but supplies no evidence, comparisons, or technical details to support any of it.

read the letter

Koalja is pitched as a generalized data wiring platform on Kubernetes that hides the underlay for a serverless experience, lets users breadboard data flows that can later move to production, captures enterprise metadata for full provenance tracing down to software versions, and adds optimizations to cut unnecessary processing and data movement for sustainability reasons, especially in edge and IoT settings. The motivation around traceability and energy use in distributed flows is reasonable and timely. The paper does a fair job of naming those practical concerns and stating the intended capabilities in plain terms. That is the extent of what it does well. The central weakness is that none of the claims are backed by anything. There is no architecture description, no implementation sketch, no evaluation, no benchmark, and no comparison to existing workflow or provenance systems. The text stays at the level of goals and features without showing how any of them are realized or why the specific combination would be an advance. Similar ideas appear in prior work on distributed pipelines and metadata tracking, so the novelty cannot be judged from what is here. This leaves the reader with a list of desired properties but no way to assess feasibility or impact. The paper is aimed at practitioners who build cloud data infrastructure and might be looking for feature ideas. It offers little to someone seeking a validated approach or a new technical result. I would not bring it to a reading group. I would not cite it. It does not merit sending to peer review in this form because there is no substance for referees to evaluate.

Referee Report

3 major / 1 minor

Summary. The manuscript describes Koalja, a generalized data wiring or pipeline platform built on top of Kubernetes for plugin user code. It claims to make the Kubernetes underlay transparent to users for a serverless experience, offer a breadboarding experience for development of data sharing circuitry that can be gradually promoted to production systems with minimal infrastructure knowledge, capture enterprise-grade metadata as data payloads flow through the circuitry to enable full tracing of provenance and forensic reconstruction down to software versions, and attend to optimizations for avoiding unwanted processing and transportation of data to minimize energy expenditure and support scaling in edge computing, IoT, and NFV scenarios.

Significance. If the described capabilities for infrastructure abstraction, provenance capture, and data-movement optimizations are realized and validated, the work could contribute to simplifying access to complex distributed systems while addressing provenance requirements and sustainability concerns in extended cloud environments. The integration of these elements in a single platform targeting gradual development-to-production transitions is potentially relevant to practitioners in data pipelines and edge computing.

major comments (3)

[Abstract] Abstract: the central claim that Koalja 'makes the Kubernetes underlay transparent to users (for a `serverless' experience)' is presented without any architectural description, mechanism, or comparison to existing Kubernetes abstractions such as those in Kubeflow; this is load-bearing for the primary user-facing benefit asserted in the paper.
[Abstract] Abstract: the claim that 'enterprise grade metadata are captured as data payloads flow through the circuitry, allowing full tracing of provenance and forensic reconstruction of transactional processes, down to the versions of software' is load-bearing for the forensic capability but supplies no metadata schema, capture points, or storage approach to substantiate how reconstruction is achieved.
[Abstract] Abstract: the statement that Koalja 'attends to optimizations for avoiding unwanted processing and transportation of data' that 'minimize energy expenditure and waste' is load-bearing for the sustainability contribution but provides no specific techniques, decision points, or quantitative results to support the claimed reductions.

minor comments (1)

The title uses the informal phrase 'Data Plumbing'; a more precise phrasing would improve academic tone while retaining the core message.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments on the abstract. We agree that the abstract's high-level claims would benefit from brief pointers to the supporting mechanisms and results described in the body of the manuscript. We address each point below and will revise the abstract accordingly in the next version.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that Koalja 'makes the Kubernetes underlay transparent to users (for a `serverless' experience)' is presented without any architectural description, mechanism, or comparison to existing Kubernetes abstractions such as those in Kubeflow; this is load-bearing for the primary user-facing benefit asserted in the paper.

Authors: Section 3 of the manuscript details the architecture, including the use of Kubernetes custom resources, operators, and pod abstractions that hide the underlay to deliver the serverless experience. Section 2 provides a comparison to Kubeflow and related systems. We will revise the abstract to include a concise reference to these architectural elements and the comparison. revision: yes
Referee: [Abstract] Abstract: the claim that 'enterprise grade metadata are captured as data payloads flow through the circuitry, allowing full tracing of provenance and forensic reconstruction of transactional processes, down to the versions of software' is load-bearing for the forensic capability but supplies no metadata schema, capture points, or storage approach to substantiate how reconstruction is achieved.

Authors: Section 4 describes the metadata schema (including payload, processing, and version metadata), the capture points at each wiring stage, and the storage approach using a queryable provenance store that supports forensic reconstruction. We will revise the abstract to briefly indicate these mechanisms. revision: yes
Referee: [Abstract] Abstract: the statement that Koalja 'attends to optimizations for avoiding unwanted processing and transportation of data' that 'minimize energy expenditure and waste' is load-bearing for the sustainability contribution but provides no specific techniques, decision points, or quantitative results to support the claimed reductions.

Authors: Section 5 specifies the optimization techniques (data-locality scheduling, lazy materialization, and edge-aware routing) and decision points; Section 6 reports quantitative energy and data-movement reductions from the evaluation. We will revise the abstract to reference these techniques and results. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The manuscript is a descriptive system paper presenting the Koalja platform architecture for data pipelines on Kubernetes. It contains no equations, derivations, predictions of fitted quantities, or first-principles results that could reduce to their own inputs. Claims about transparency, provenance capture, and energy optimizations are stated as design goals and capabilities rather than derived outputs. No self-citations, uniqueness theorems, or ansatzes appear in a load-bearing role. The derivation chain is therefore empty and self-contained by virtue of being non-mathematical.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

No mathematical content; the ledger is empty because the paper is a system description rather than a derivation or empirical study.

invented entities (1)

Koalja platform no independent evidence
purpose: Generalized data wiring and pipeline system on Kubernetes with provenance and optimization features
The described system is the main contribution; no independent evidence or falsifiable predictions are supplied in the abstract.

pith-pipeline@v0.9.0 · 5680 in / 1041 out tokens · 23583 ms · 2026-05-25T10:12:30.028486+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

18 extracted references · 18 canonical work pages · 3 internal anchors

[1]

David and H

R. David and H. Alla. Petri nets for modelling of dynamic systems — a survey. Automatica, 30:175–202, 1994

work page 1994
[2]

Akidau, R

T. Akidau, R. Bradshaw, C. Chambers, S. Chernyak, R.J. Fernndez-Moctezuma, R. Lax, S. McVeety, D. Mills, F. Perry, E. Schmidt, and S. Whittle. The dataﬂow model: A prac- tical approach to balancing correctness, latency, and cost in massive-scale, unbounded, out-of-order data processing. Proceedings of the VLDB Endowment , 8:1792–1803, 2015

work page 2015
[3]

D. Wampler. Fast Data Architectures for Streaming Appli- cations. O’Reilly, 2016

work page 2016
[4]

Real-time concurrent checkpoint for parallel programs

Kai Li, Jeffrey Naughton, and James Plank. Real-time concurrent checkpoint for parallel programs. In Proceedings of the second ACM SIGPLAN Symposium on principles and practice of parallel programming , pages 79–88. Association for Computing Machinery, 1990

work page 1990
[5]

M. Liu, M. Li, D. Golovnya, E. A. Rundensteiner, and K. Claypool. Sequence pattern query processing over out- of-order event streams. In 2009 IEEE 25th International Conference on Data Engineering , pages 784–795, March 2009

work page 2009
[6]

H. Li, A. Ghodsi, M. Zaharia, S. Shenker, and I. Stoica. Tachyon: Reliable, memory speed storage for cluster com- puting frameworks. In Proceedings of the ACM Symposium on Cloud Computing , SOCC ’14, pages 6:1–6:15, New York, NY , USA, 2014. ACM

work page 2014
[7]

Zaharia, T

M. Zaharia, T. Das, H. Li, T. Hunter, S. Shenker, and I. Stoica. Discretized streams: Fault-tolerant streaming com- putation at scale. In Proceedings of the Twenty-F ourth ACM Symposium on Operating Systems Principles , SOSP ’13, pages 423–438, New York, NY , USA, 2013. ACM

work page 2013
[8]

Koji: Automating pipelines with mixed-semantics data sources

Petar Maymounkov. Koji: Automating pipelines with mixed- semantics data sources. CoRR, abs/1901.01908, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1901
[9]

Borrill, M

P. Borrill, M. Burgess, M. Dvorkin, and H. Wildfeuer. Workspaces. Technical report, 2015

work page 2015
[10]

M. Burgess. Spacetimes with semantics (iii). http://arxiv.org/abs/1608.02193, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[11]

M. Burgess. A spacetime approach to generalized cognitive reasoning in multi-scale learning. https://arxiv.org/abs/1702.04638, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[12]

M. Burgess. A site conﬁguration engine. Computing systems (MIT Press: Cambridge MA) , 8:309, 1995

work page 1995
[13]

Bergstra and M

J.A. Bergstra and M. Burgess. Promise Theory: Principles and Applications . χtAxis Press, 2014

work page 2014
[14]

MinIO Object Storage Project

work page
[15]

Hartig and J

O. Hartig and J. P ´erez. Semantics and complexity of graphql. In Proceedings of the 2018 World Wide Web Conference , WWW ’18, pages 1155–1164, Republic and Canton of Geneva, Switzerland, 2018. International World Wide Web Conferences Steering Committee

work page 2018
[16]

M. Burgess. Observability in distributed systems. unpub- lished, 2019

work page 2019
[17]

Burgess and H

M. Burgess and H. Wildfeuer. Federated multi-tenant service architecture for an internet of things. https://tools.ietf.org/html/draft-burgess -promise-iot-arch-00, October 2015

work page 2015
[18]

M. Burgess. Cellibrium project. https://github.com/markburgess/Cellibrium, 2015. 13

work page 2015

[1] [1]

David and H

R. David and H. Alla. Petri nets for modelling of dynamic systems — a survey. Automatica, 30:175–202, 1994

work page 1994

[2] [2]

Akidau, R

T. Akidau, R. Bradshaw, C. Chambers, S. Chernyak, R.J. Fernndez-Moctezuma, R. Lax, S. McVeety, D. Mills, F. Perry, E. Schmidt, and S. Whittle. The dataﬂow model: A prac- tical approach to balancing correctness, latency, and cost in massive-scale, unbounded, out-of-order data processing. Proceedings of the VLDB Endowment , 8:1792–1803, 2015

work page 2015

[3] [3]

D. Wampler. Fast Data Architectures for Streaming Appli- cations. O’Reilly, 2016

work page 2016

[4] [4]

Real-time concurrent checkpoint for parallel programs

Kai Li, Jeffrey Naughton, and James Plank. Real-time concurrent checkpoint for parallel programs. In Proceedings of the second ACM SIGPLAN Symposium on principles and practice of parallel programming , pages 79–88. Association for Computing Machinery, 1990

work page 1990

[5] [5]

M. Liu, M. Li, D. Golovnya, E. A. Rundensteiner, and K. Claypool. Sequence pattern query processing over out- of-order event streams. In 2009 IEEE 25th International Conference on Data Engineering , pages 784–795, March 2009

work page 2009

[6] [6]

H. Li, A. Ghodsi, M. Zaharia, S. Shenker, and I. Stoica. Tachyon: Reliable, memory speed storage for cluster com- puting frameworks. In Proceedings of the ACM Symposium on Cloud Computing , SOCC ’14, pages 6:1–6:15, New York, NY , USA, 2014. ACM

work page 2014

[7] [7]

Zaharia, T

M. Zaharia, T. Das, H. Li, T. Hunter, S. Shenker, and I. Stoica. Discretized streams: Fault-tolerant streaming com- putation at scale. In Proceedings of the Twenty-F ourth ACM Symposium on Operating Systems Principles , SOSP ’13, pages 423–438, New York, NY , USA, 2013. ACM

work page 2013

[8] [8]

Koji: Automating pipelines with mixed-semantics data sources

Petar Maymounkov. Koji: Automating pipelines with mixed- semantics data sources. CoRR, abs/1901.01908, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1901

[9] [9]

Borrill, M

P. Borrill, M. Burgess, M. Dvorkin, and H. Wildfeuer. Workspaces. Technical report, 2015

work page 2015

[10] [10]

M. Burgess. Spacetimes with semantics (iii). http://arxiv.org/abs/1608.02193, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[11] [11]

M. Burgess. A spacetime approach to generalized cognitive reasoning in multi-scale learning. https://arxiv.org/abs/1702.04638, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[12] [12]

M. Burgess. A site conﬁguration engine. Computing systems (MIT Press: Cambridge MA) , 8:309, 1995

work page 1995

[13] [13]

Bergstra and M

J.A. Bergstra and M. Burgess. Promise Theory: Principles and Applications . χtAxis Press, 2014

work page 2014

[14] [14]

MinIO Object Storage Project

work page

[15] [15]

Hartig and J

O. Hartig and J. P ´erez. Semantics and complexity of graphql. In Proceedings of the 2018 World Wide Web Conference , WWW ’18, pages 1155–1164, Republic and Canton of Geneva, Switzerland, 2018. International World Wide Web Conferences Steering Committee

work page 2018

[16] [16]

M. Burgess. Observability in distributed systems. unpub- lished, 2019

work page 2019

[17] [17]

Burgess and H

M. Burgess and H. Wildfeuer. Federated multi-tenant service architecture for an internet of things. https://tools.ietf.org/html/draft-burgess -promise-iot-arch-00, October 2015

work page 2015

[18] [18]

M. Burgess. Cellibrium project. https://github.com/markburgess/Cellibrium, 2015. 13

work page 2015