Elastic Scheduling of Intermittent Query Processing in a Cluster Environment

Saranya Chandrasekaran; S. Sudarshan

arxiv: 2605.08601 · v2 · pith:2A6XIJWYnew · submitted 2026-05-09 · 💻 cs.DB

Elastic Scheduling of Intermittent Query Processing in a Cluster Environment

Saranya Chandrasekaran , S. Sudarshan This is my paper

Pith reviewed 2026-05-20 23:31 UTC · model grok-4.3

classification 💻 cs.DB

keywords elastic schedulingintermittent query processingdeadline constraintscost minimizationcluster scalingstream processingApache Spark

0 comments

The pith

Elastic scheduling for batched queries in clusters meets deadlines at minimum cost by scaling nodes dynamically.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops scheduling schemes for processing data streams in batches over fixed windows rather than processing tuples continuously. By scaling the number of nodes up or down in an elastic cluster, these schemes complete each batch by its deadline while keeping total resource usage low. The methods also manage several queries at once, accept new queries that arrive mid-stream, and respond to changes in data arrival rates. Tests on Spark running in a cloud service with TPC-H and Yahoo data show the elastic batch approach uses less cost than fixed node counts or standard streaming.

Core claim

The central claim is that intermittent batched processing of tuples in an elastic parallel environment can be scheduled by scaling nodes to meet deadlines after each window while minimizing overall cost, with the schemes extending to concurrent queries, new query arrivals, and input rate variations.

What carries the argument

Scheduling schemes that allocate nodes for each batch to minimize cost subject to deadline constraints in an elastic setting.

If this is right

Deadlines are met for window-based queries even when multiple queries run concurrently.
New queries can arrive and be incorporated without restarting the schedule.
Input rate changes are absorbed by adjusting node counts between batches.
Evaluations show lower cost than fixed clusters or eager Spark streaming on standard benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This approach suggests cloud stream analytics can run at lower hourly expense by avoiding constant over-provisioning.
The same batch-and-scale logic could transfer to other distributed engines that support dynamic resource allocation.
Better runtime prediction of batch times would make the cost savings more reliable across varying workloads.

Load-bearing premise

That processing times, scaling costs, and deadline constraints can be modeled accurately enough to allow optimization of node allocation for minimum cost without violating deadlines in real elastic environments.

What would settle it

An experiment in which the elastic scheduler misses a deadline or uses more total cost than a fixed-node baseline under realistic rate fluctuations would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.08601 by Saranya Chandrasekaran, S. Sudarshan.

**Figure 1.** Figure 1: Components of Custom Scheduler Also, for multiple streams, we assume that all the matching tuples from the input streams arrive together. During simulation for assessing the variations in the input rate that can be supported with the already generated schedule, we consider the same amount of variation in each of the input streams. For example, for assessing the maximum rate, we consider both orders and lin… view at source ↗

**Figure 2.** Figure 2: TPC-Q1 processing duration for different configu [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗

**Figure 3.** Figure 3: TPC-Q1 processing duration (for 4500 files) for dif [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗

**Figure 4.** Figure 4: Variable Input Rate Profiles and Node Requirements [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗

**Figure 5.** Figure 5: Release of nodes experiment 9.7 Experiments with Partial Aggregation We carried out experiments with partial aggregation enabled for 0.4D and 0.3D cases with baseline input rate. The frequency of doing partial aggregation was set as 25% i.e. to carry out partial aggregation once after every 1/4th of the total number of batches are processed [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗

**Figure 5.** Figure 5: Release of nodes experiment [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗

read the original abstract

Many applications process a stream of tuples over a window duration, and require the results within a specified deadline after the end of the window. For such scenarios, processing tuples intermittently (in batches) instead of eagerly processing tuples as they arrive significantly reduces the overall cost. Earlier work on intermittent query processing has addressed only fixed environments. In this paper, we propose scheduling schemes for batched processing of tuples, in an elastic parallel environment, scaling nodes up or down. Our scheduling schemes ensure to meet the deadlines, while incurring minimum cost. Our schemes also handle multiple concurrent queries, the arrival of new queries, and input rate variations. We have implemented our schemes on top of Apache Spark, in the AWS EMR environment, and evaluated performance with both TPC-H and Yahoo Streaming datasets. Our experimental results show that our scheduling algorithms significantly outperform alternatives, such as using a fixed set of nodes without elasticity, or using Spark streaming.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Extends intermittent query processing to elastic clusters with Spark implementation and dynamic handling, but rests on untested accuracy of processing time and scaling cost models.

read the letter

The main takeaway is that this paper moves intermittent query processing from fixed environments to elastic clusters, with scheduling schemes that scale nodes up or down to meet deadlines at the lowest cost while dealing with concurrent queries, new arrivals, and rate variations. They implemented this on Apache Spark in AWS EMR and tested it on TPC-H and Yahoo Streaming datasets. The results claim clear wins over fixed-node configurations and plain Spark streaming. That practical implementation and the handling of dynamic elements are the parts that stand out as useful. The soft spot is the dependence on pre-computed processing times and scaling costs being accurate. Real elastic environments bring node startup delays, shuffle overheads, and unpredictable rate shifts that can break those models, especially with shared resources across queries. The work would be more convincing with sensitivity checks or adaptive re-estimation, but those seem missing from what is described. This is aimed at database and big data folks working on cost-efficient stream processing with deadlines. Readers who need applied scheduling ideas for cloud clusters will get something out of the experiments and comparisons. It has enough of a real-system evaluation to merit a serious referee, even if the modeling robustness needs attention. I would send it for peer review with a request for more validation on the assumption that the models stay reliable under elasticity.

Referee Report

2 major / 1 minor

Summary. The paper proposes scheduling schemes for intermittent batched query processing over windows in an elastic parallel cluster (Apache Spark on AWS EMR). The schemes dynamically scale nodes to meet deadlines at minimum cost, while also handling concurrent queries, new query arrivals, and input-rate variations. Experiments on TPC-H and Yahoo Streaming datasets report significant outperformance versus fixed-node allocations and standard Spark streaming.

Significance. If the modeling assumptions hold, the work addresses a practical gap by extending intermittent processing from fixed to elastic environments with concurrency support. The real-system implementation on Spark/AWS EMR and evaluation on standard benchmarks constitute concrete strengths that could inform cost-aware stream processing in cloud settings.

major comments (2)

[§4] §4 (Scheduling Formulation): The deadline-meeting and minimum-cost claims rest on the assumption that per-tuple processing times, scaling latencies, and costs can be accurately pre-computed and remain stable; the manuscript provides no sensitivity analysis or error bounds showing robustness when these estimates deviate under real node-acquisition overheads and rate changes.
[Experimental Evaluation section] Experimental Evaluation section, results tables: Reported outperformance lacks error bars, number of runs, or explicit methodology for measuring deadline violations and cost under concurrent queries and rate variations, leaving the central empirical claim difficult to verify against modeling inaccuracies.

minor comments (1)

[§3] Notation for the cost and deadline constraints could be made more explicit with a single summary table of symbols.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the modeling assumptions and experimental reporting. We address each major comment below and describe the revisions planned for the next version of the manuscript.

read point-by-point responses

Referee: [§4] §4 (Scheduling Formulation): The deadline-meeting and minimum-cost claims rest on the assumption that per-tuple processing times, scaling latencies, and costs can be accurately pre-computed and remain stable; the manuscript provides no sensitivity analysis or error bounds showing robustness when these estimates deviate under real node-acquisition overheads and rate changes.

Authors: We acknowledge the validity of this observation. The current formulation relies on profiled estimates for processing times and latencies. In the revised manuscript we will add a sensitivity analysis subsection to §4. This analysis will perturb per-tuple processing times and scaling latencies by ±25% and report the resulting changes in deadline compliance rates and total cost, thereby providing explicit error bounds under realistic estimation inaccuracies. revision: yes
Referee: Experimental Evaluation section, results tables: Reported outperformance lacks error bars, number of runs, or explicit methodology for measuring deadline violations and cost under concurrent queries and rate variations, leaving the central empirical claim difficult to verify against modeling inaccuracies.

Authors: We agree that additional statistical detail is needed. We will revise the Experimental Evaluation section to report results averaged over 10 independent runs, including error bars for standard deviation. We will also add explicit methodology text describing how deadline violations are counted (fraction of windows whose completion exceeds the deadline) and how cost is aggregated under concurrent queries and input-rate changes. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical validation of scheduling schemes via implementation and external datasets

full rationale

The paper describes scheduling schemes for intermittent batched query processing in elastic clusters, with claims of deadline compliance at minimum cost supported by direct implementation on Apache Spark in AWS EMR and performance evaluation against TPC-H and Yahoo Streaming datasets. No derivation chain, equations, or first-principles results are presented that reduce by construction to fitted inputs, self-definitions, or self-citations; the outperformance claims rest on experimental comparisons to fixed-node and Spark streaming baselines rather than any internal fitting or renaming of known results. The work is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based solely on the abstract, no explicit free parameters, axioms, or invented entities are described. The approach implicitly relies on standard assumptions about cloud resource costs and processing predictability, but these cannot be audited without the full text.

pith-pipeline@v0.9.0 · 5683 in / 1093 out tokens · 36873 ms · 2026-05-20T23:31:19.903208+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose scheduling schemes for batched processing of tuples, in an elastic parallel environment, scaling nodes up or down. Our scheduling schemes ensure to meet the deadlines, while incurring minimum cost.
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean J_uniquely_calibrated_via_higher_derivative unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The cost model in terms of cost per tuple and overhead cost is given in equation 2 TP = ((1−P) + P/Np) ∗ Nt ∗ CPT + ON + OX

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages

[1]

Benjamin Berg, Mor Harchol-Balter, Benjamin Moseley, Weina Wang, and Justin Whitehouse. 2020. Optimal resource allocation for elastic and inelastic jobs. InProceedings of the 32nd ACM Symposium on Parallelism in Algorithms and Architectures. 75–87

work page 2020
[2]

Saranya Chandrasekaran and S Sudarshan. 2024. Scheduling of intermittent query processing. InInternational Database Engineered Applications Symposium. Springer, 363–376

work page 2024
[3]

Dazhao Cheng, Yuan Chen, Xiaobo Zhou, Daniel Gmach, and Dejan Milojicic

work page
[4]

InIEEE INFOCOM

Adaptive scheduling of parallel jobs in Spark Streaming. InIEEE INFOCOM. 1–9

work page
[5]

Stratos Dimopoulos, Chandra Krintz, and Rich Wolski. 2017. Justice: A Deadline- Aware, Fair-Share Resource Allocator for Implementing Multi-Analytics. InIEEE CLUSTER. 233–244

work page 2017
[6]

Xiaofei Hou, TK Ashwin Kumar, Johnson P Thomas, and Hong Liu. 2017. Dy- namic deadline-constraint scheduler for Hadoop YARN. In2017 IEEE SmartWorld, Ubiquitous Intelligence & Computing, Advanced & Trusted Computed, Scalable Computing & Communications, Cloud & Big Data Computing, Internet of People and Smart City Innovation (SmartWorld/SCALCOM/UIC/ATC/CBDC...

work page 2017
[7]

Botong Huang, Lianggui Weng, Wei Chen, Kai Zeng, Yihui Feng, Bolin Ding, Jin- gren Zhou, Zuozhi Wang, and Chen Li. 2025. Agamotto: Scheduling of Deadline- Oriented Incremental Query Execution under Uncertain Resource Price.Pro- ceedings of the VLDB Endowment18, 6 (2025), 1852–1864

work page 2025
[8]

Islam, S

M. Islam, S. Karunasekera, and R. Buyya. 2017. dSpark: Deadline-Based Resource Allocation for Big Data Applications in Apache Spark. InIEEE International Conference on e-Science (e-Science). 89–98

work page 2017
[9]

Zhengyu Ou, Ge Yu, Yaxin Yu, Shanshan Wu, Xiaochun Yang, and Qingxu Deng

work page
[10]

InAdvances in Web-Age Inf

Tick Scheduling: A Deadline Based Optimal Task Scheduling Approach for Real-Time Data Stream Systems. InAdvances in Web-Age Inf. Management W AIM, Vol. 3739. 725–730

work page
[11]

Zechao Shang et al. 2020. CrocodileDB: Efficient Database Execution through Intelligent Deferment. InConf. on Innovative Data Systems Research, CIDR

work page 2020
[12]

Subhajit Sidhanta, Wojciech Golab, and Supratik Mukhopadhyay. 2016. OptEx: A deadline-aware cost optimization model for spark. In2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid). IEEE, 193–202

work page 2016
[13]

Daniel Sotolongo, Daniel Mills, Tyler Akidau, Anirudh Santhiar, Attila-Péter Tóth, Botong Huang, Boyuan Zhang, Igor Belianski, Ling Geng, Matt Uhlar, et al. 2025. Streaming Democratized: Ease Across the Latency Spectrum with Delayed View Semantics and Snowflake Dynamic Tables. InCompanion of the 2025 International Conference on Management of Data. 622–634

work page 2025
[14]

Dixin Tang et al. 2020. CrocodileDB in Action: Resource-Efficient Query Execu- tion by Exploiting Time Slackness.Proc. VLDB Endow.13, 12 (2020), 2937–2940

work page 2020
[15]

Elmore, Sanjay Krishnan, and Michael J

Dixin Tang, Zechao Shang, Aaron J. Elmore, Sanjay Krishnan, and Michael J. Franklin. 2019. Intermittent Query Processing.Proc. VLDB Endow.12, 11 (2019), 1427–1441

work page 2019
[16]

Elmore, Sanjay Krishnan, and Michael J

Dixin Tang, Zechao Shang, Aaron J. Elmore, Sanjay Krishnan, and Michael J. Franklin. 2020. Thrifty Query Execution via Incrementability. InACM SIGMOD. ACM, 1241–1256

work page 2020
[17]

Vinod Kumar Vavilapalli et al. 2013. Apache Hadoop YARN: yet another resource negotiator. InACM Symposium on Cloud Computing, SOCC. 5:1–5:16. https: //doi.org/10.1145/2523616.2523633

work page doi:10.1145/2523616.2523633 2013
[18]

Guolu Wang, Jungang Xu, Renfeng Liu, and Shanshan Huang. 2018. A Hard Real-time Scheduler for Spark on YARN. InIEEE/ACM CCGRID. 645–652

work page 2018
[19]

Zuozhi Wang et al. 2020. Grosbeak: A Data Warehouse Supporting Resource- Aware Incremental Computing. InACM SIGMOD. 2797–2800

work page 2020
[20]

Jie Zhu, Xiaoping Li, Rubén Ruiz, and Xiaolong Xu. 2018. Scheduling stochastic multi-stage jobs to elastic hybrid cloud resources.IEEE Transactions on Parallel and Distributed Systems29, 6 (2018), 1401–1415. 15

work page 2018

[1] [1]

Benjamin Berg, Mor Harchol-Balter, Benjamin Moseley, Weina Wang, and Justin Whitehouse. 2020. Optimal resource allocation for elastic and inelastic jobs. InProceedings of the 32nd ACM Symposium on Parallelism in Algorithms and Architectures. 75–87

work page 2020

[2] [2]

Saranya Chandrasekaran and S Sudarshan. 2024. Scheduling of intermittent query processing. InInternational Database Engineered Applications Symposium. Springer, 363–376

work page 2024

[3] [3]

Dazhao Cheng, Yuan Chen, Xiaobo Zhou, Daniel Gmach, and Dejan Milojicic

work page

[4] [4]

InIEEE INFOCOM

Adaptive scheduling of parallel jobs in Spark Streaming. InIEEE INFOCOM. 1–9

work page

[5] [5]

Stratos Dimopoulos, Chandra Krintz, and Rich Wolski. 2017. Justice: A Deadline- Aware, Fair-Share Resource Allocator for Implementing Multi-Analytics. InIEEE CLUSTER. 233–244

work page 2017

[6] [6]

Xiaofei Hou, TK Ashwin Kumar, Johnson P Thomas, and Hong Liu. 2017. Dy- namic deadline-constraint scheduler for Hadoop YARN. In2017 IEEE SmartWorld, Ubiquitous Intelligence & Computing, Advanced & Trusted Computed, Scalable Computing & Communications, Cloud & Big Data Computing, Internet of People and Smart City Innovation (SmartWorld/SCALCOM/UIC/ATC/CBDC...

work page 2017

[7] [7]

Botong Huang, Lianggui Weng, Wei Chen, Kai Zeng, Yihui Feng, Bolin Ding, Jin- gren Zhou, Zuozhi Wang, and Chen Li. 2025. Agamotto: Scheduling of Deadline- Oriented Incremental Query Execution under Uncertain Resource Price.Pro- ceedings of the VLDB Endowment18, 6 (2025), 1852–1864

work page 2025

[8] [8]

Islam, S

M. Islam, S. Karunasekera, and R. Buyya. 2017. dSpark: Deadline-Based Resource Allocation for Big Data Applications in Apache Spark. InIEEE International Conference on e-Science (e-Science). 89–98

work page 2017

[9] [9]

Zhengyu Ou, Ge Yu, Yaxin Yu, Shanshan Wu, Xiaochun Yang, and Qingxu Deng

work page

[10] [10]

InAdvances in Web-Age Inf

Tick Scheduling: A Deadline Based Optimal Task Scheduling Approach for Real-Time Data Stream Systems. InAdvances in Web-Age Inf. Management W AIM, Vol. 3739. 725–730

work page

[11] [11]

Zechao Shang et al. 2020. CrocodileDB: Efficient Database Execution through Intelligent Deferment. InConf. on Innovative Data Systems Research, CIDR

work page 2020

[12] [12]

Subhajit Sidhanta, Wojciech Golab, and Supratik Mukhopadhyay. 2016. OptEx: A deadline-aware cost optimization model for spark. In2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid). IEEE, 193–202

work page 2016

[13] [13]

Daniel Sotolongo, Daniel Mills, Tyler Akidau, Anirudh Santhiar, Attila-Péter Tóth, Botong Huang, Boyuan Zhang, Igor Belianski, Ling Geng, Matt Uhlar, et al. 2025. Streaming Democratized: Ease Across the Latency Spectrum with Delayed View Semantics and Snowflake Dynamic Tables. InCompanion of the 2025 International Conference on Management of Data. 622–634

work page 2025

[14] [14]

Dixin Tang et al. 2020. CrocodileDB in Action: Resource-Efficient Query Execu- tion by Exploiting Time Slackness.Proc. VLDB Endow.13, 12 (2020), 2937–2940

work page 2020

[15] [15]

Elmore, Sanjay Krishnan, and Michael J

Dixin Tang, Zechao Shang, Aaron J. Elmore, Sanjay Krishnan, and Michael J. Franklin. 2019. Intermittent Query Processing.Proc. VLDB Endow.12, 11 (2019), 1427–1441

work page 2019

[16] [16]

Elmore, Sanjay Krishnan, and Michael J

Dixin Tang, Zechao Shang, Aaron J. Elmore, Sanjay Krishnan, and Michael J. Franklin. 2020. Thrifty Query Execution via Incrementability. InACM SIGMOD. ACM, 1241–1256

work page 2020

[17] [17]

Vinod Kumar Vavilapalli et al. 2013. Apache Hadoop YARN: yet another resource negotiator. InACM Symposium on Cloud Computing, SOCC. 5:1–5:16. https: //doi.org/10.1145/2523616.2523633

work page doi:10.1145/2523616.2523633 2013

[18] [18]

Guolu Wang, Jungang Xu, Renfeng Liu, and Shanshan Huang. 2018. A Hard Real-time Scheduler for Spark on YARN. InIEEE/ACM CCGRID. 645–652

work page 2018

[19] [19]

Zuozhi Wang et al. 2020. Grosbeak: A Data Warehouse Supporting Resource- Aware Incremental Computing. InACM SIGMOD. 2797–2800

work page 2020

[20] [20]

Jie Zhu, Xiaoping Li, Rubén Ruiz, and Xiaolong Xu. 2018. Scheduling stochastic multi-stage jobs to elastic hybrid cloud resources.IEEE Transactions on Parallel and Distributed Systems29, 6 (2018), 1401–1415. 15

work page 2018