pith. sign in

arxiv: 2605.08601 · v2 · pith:2A6XIJWYnew · submitted 2026-05-09 · 💻 cs.DB

Elastic Scheduling of Intermittent Query Processing in a Cluster Environment

Pith reviewed 2026-05-20 23:31 UTC · model grok-4.3

classification 💻 cs.DB
keywords elastic schedulingintermittent query processingdeadline constraintscost minimizationcluster scalingstream processingApache Spark
0
0 comments X

The pith

Elastic scheduling for batched queries in clusters meets deadlines at minimum cost by scaling nodes dynamically.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops scheduling schemes for processing data streams in batches over fixed windows rather than processing tuples continuously. By scaling the number of nodes up or down in an elastic cluster, these schemes complete each batch by its deadline while keeping total resource usage low. The methods also manage several queries at once, accept new queries that arrive mid-stream, and respond to changes in data arrival rates. Tests on Spark running in a cloud service with TPC-H and Yahoo data show the elastic batch approach uses less cost than fixed node counts or standard streaming.

Core claim

The central claim is that intermittent batched processing of tuples in an elastic parallel environment can be scheduled by scaling nodes to meet deadlines after each window while minimizing overall cost, with the schemes extending to concurrent queries, new query arrivals, and input rate variations.

What carries the argument

Scheduling schemes that allocate nodes for each batch to minimize cost subject to deadline constraints in an elastic setting.

If this is right

  • Deadlines are met for window-based queries even when multiple queries run concurrently.
  • New queries can arrive and be incorporated without restarting the schedule.
  • Input rate changes are absorbed by adjusting node counts between batches.
  • Evaluations show lower cost than fixed clusters or eager Spark streaming on standard benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This approach suggests cloud stream analytics can run at lower hourly expense by avoiding constant over-provisioning.
  • The same batch-and-scale logic could transfer to other distributed engines that support dynamic resource allocation.
  • Better runtime prediction of batch times would make the cost savings more reliable across varying workloads.

Load-bearing premise

That processing times, scaling costs, and deadline constraints can be modeled accurately enough to allow optimization of node allocation for minimum cost without violating deadlines in real elastic environments.

What would settle it

An experiment in which the elastic scheduler misses a deadline or uses more total cost than a fixed-node baseline under realistic rate fluctuations would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.08601 by Saranya Chandrasekaran, S. Sudarshan.

Figure 1
Figure 1. Figure 1: Components of Custom Scheduler Also, for multiple streams, we assume that all the matching tuples from the input streams arrive together. During simulation for assessing the variations in the input rate that can be supported with the already generated schedule, we consider the same amount of variation in each of the input streams. For example, for assessing the maximum rate, we consider both orders and lin… view at source ↗
Figure 2
Figure 2. Figure 2: TPC-Q1 processing duration for different configu [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: TPC-Q1 processing duration (for 4500 files) for dif [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Variable Input Rate Profiles and Node Requirements [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Release of nodes experiment 9.7 Experiments with Partial Aggregation We carried out experiments with partial aggregation enabled for 0.4D and 0.3D cases with baseline input rate. The frequency of doing partial aggregation was set as 25% i.e. to carry out partial aggregation once after every 1/4th of the total number of batches are processed [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗
Figure 5
Figure 5. Figure 5: Release of nodes experiment [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗
read the original abstract

Many applications process a stream of tuples over a window duration, and require the results within a specified deadline after the end of the window. For such scenarios, processing tuples intermittently (in batches) instead of eagerly processing tuples as they arrive significantly reduces the overall cost. Earlier work on intermittent query processing has addressed only fixed environments. In this paper, we propose scheduling schemes for batched processing of tuples, in an elastic parallel environment, scaling nodes up or down. Our scheduling schemes ensure to meet the deadlines, while incurring minimum cost. Our schemes also handle multiple concurrent queries, the arrival of new queries, and input rate variations. We have implemented our schemes on top of Apache Spark, in the AWS EMR environment, and evaluated performance with both TPC-H and Yahoo Streaming datasets. Our experimental results show that our scheduling algorithms significantly outperform alternatives, such as using a fixed set of nodes without elasticity, or using Spark streaming.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes scheduling schemes for intermittent batched query processing over windows in an elastic parallel cluster (Apache Spark on AWS EMR). The schemes dynamically scale nodes to meet deadlines at minimum cost, while also handling concurrent queries, new query arrivals, and input-rate variations. Experiments on TPC-H and Yahoo Streaming datasets report significant outperformance versus fixed-node allocations and standard Spark streaming.

Significance. If the modeling assumptions hold, the work addresses a practical gap by extending intermittent processing from fixed to elastic environments with concurrency support. The real-system implementation on Spark/AWS EMR and evaluation on standard benchmarks constitute concrete strengths that could inform cost-aware stream processing in cloud settings.

major comments (2)
  1. [§4] §4 (Scheduling Formulation): The deadline-meeting and minimum-cost claims rest on the assumption that per-tuple processing times, scaling latencies, and costs can be accurately pre-computed and remain stable; the manuscript provides no sensitivity analysis or error bounds showing robustness when these estimates deviate under real node-acquisition overheads and rate changes.
  2. [Experimental Evaluation section] Experimental Evaluation section, results tables: Reported outperformance lacks error bars, number of runs, or explicit methodology for measuring deadline violations and cost under concurrent queries and rate variations, leaving the central empirical claim difficult to verify against modeling inaccuracies.
minor comments (1)
  1. [§3] Notation for the cost and deadline constraints could be made more explicit with a single summary table of symbols.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the modeling assumptions and experimental reporting. We address each major comment below and describe the revisions planned for the next version of the manuscript.

read point-by-point responses
  1. Referee: [§4] §4 (Scheduling Formulation): The deadline-meeting and minimum-cost claims rest on the assumption that per-tuple processing times, scaling latencies, and costs can be accurately pre-computed and remain stable; the manuscript provides no sensitivity analysis or error bounds showing robustness when these estimates deviate under real node-acquisition overheads and rate changes.

    Authors: We acknowledge the validity of this observation. The current formulation relies on profiled estimates for processing times and latencies. In the revised manuscript we will add a sensitivity analysis subsection to §4. This analysis will perturb per-tuple processing times and scaling latencies by ±25% and report the resulting changes in deadline compliance rates and total cost, thereby providing explicit error bounds under realistic estimation inaccuracies. revision: yes

  2. Referee: Experimental Evaluation section, results tables: Reported outperformance lacks error bars, number of runs, or explicit methodology for measuring deadline violations and cost under concurrent queries and rate variations, leaving the central empirical claim difficult to verify against modeling inaccuracies.

    Authors: We agree that additional statistical detail is needed. We will revise the Experimental Evaluation section to report results averaged over 10 independent runs, including error bars for standard deviation. We will also add explicit methodology text describing how deadline violations are counted (fraction of windows whose completion exceeds the deadline) and how cost is aggregated under concurrent queries and input-rate changes. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical validation of scheduling schemes via implementation and external datasets

full rationale

The paper describes scheduling schemes for intermittent batched query processing in elastic clusters, with claims of deadline compliance at minimum cost supported by direct implementation on Apache Spark in AWS EMR and performance evaluation against TPC-H and Yahoo Streaming datasets. No derivation chain, equations, or first-principles results are presented that reduce by construction to fitted inputs, self-definitions, or self-citations; the outperformance claims rest on experimental comparisons to fixed-node and Spark streaming baselines rather than any internal fitting or renaming of known results. The work is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based solely on the abstract, no explicit free parameters, axioms, or invented entities are described. The approach implicitly relies on standard assumptions about cloud resource costs and processing predictability, but these cannot be audited without the full text.

pith-pipeline@v0.9.0 · 5683 in / 1093 out tokens · 36873 ms · 2026-05-20T23:31:19.903208+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages

  1. [1]

    Benjamin Berg, Mor Harchol-Balter, Benjamin Moseley, Weina Wang, and Justin Whitehouse. 2020. Optimal resource allocation for elastic and inelastic jobs. InProceedings of the 32nd ACM Symposium on Parallelism in Algorithms and Architectures. 75–87

  2. [2]

    Saranya Chandrasekaran and S Sudarshan. 2024. Scheduling of intermittent query processing. InInternational Database Engineered Applications Symposium. Springer, 363–376

  3. [3]

    Dazhao Cheng, Yuan Chen, Xiaobo Zhou, Daniel Gmach, and Dejan Milojicic

  4. [4]

    InIEEE INFOCOM

    Adaptive scheduling of parallel jobs in Spark Streaming. InIEEE INFOCOM. 1–9

  5. [5]

    Stratos Dimopoulos, Chandra Krintz, and Rich Wolski. 2017. Justice: A Deadline- Aware, Fair-Share Resource Allocator for Implementing Multi-Analytics. InIEEE CLUSTER. 233–244

  6. [6]

    Xiaofei Hou, TK Ashwin Kumar, Johnson P Thomas, and Hong Liu. 2017. Dy- namic deadline-constraint scheduler for Hadoop YARN. In2017 IEEE SmartWorld, Ubiquitous Intelligence & Computing, Advanced & Trusted Computed, Scalable Computing & Communications, Cloud & Big Data Computing, Internet of People and Smart City Innovation (SmartWorld/SCALCOM/UIC/ATC/CBDC...

  7. [7]

    Botong Huang, Lianggui Weng, Wei Chen, Kai Zeng, Yihui Feng, Bolin Ding, Jin- gren Zhou, Zuozhi Wang, and Chen Li. 2025. Agamotto: Scheduling of Deadline- Oriented Incremental Query Execution under Uncertain Resource Price.Pro- ceedings of the VLDB Endowment18, 6 (2025), 1852–1864

  8. [8]

    Islam, S

    M. Islam, S. Karunasekera, and R. Buyya. 2017. dSpark: Deadline-Based Resource Allocation for Big Data Applications in Apache Spark. InIEEE International Conference on e-Science (e-Science). 89–98

  9. [9]

    Zhengyu Ou, Ge Yu, Yaxin Yu, Shanshan Wu, Xiaochun Yang, and Qingxu Deng

  10. [10]

    InAdvances in Web-Age Inf

    Tick Scheduling: A Deadline Based Optimal Task Scheduling Approach for Real-Time Data Stream Systems. InAdvances in Web-Age Inf. Management W AIM, Vol. 3739. 725–730

  11. [11]

    Zechao Shang et al. 2020. CrocodileDB: Efficient Database Execution through Intelligent Deferment. InConf. on Innovative Data Systems Research, CIDR

  12. [12]

    Subhajit Sidhanta, Wojciech Golab, and Supratik Mukhopadhyay. 2016. OptEx: A deadline-aware cost optimization model for spark. In2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid). IEEE, 193–202

  13. [13]

    Daniel Sotolongo, Daniel Mills, Tyler Akidau, Anirudh Santhiar, Attila-Péter Tóth, Botong Huang, Boyuan Zhang, Igor Belianski, Ling Geng, Matt Uhlar, et al. 2025. Streaming Democratized: Ease Across the Latency Spectrum with Delayed View Semantics and Snowflake Dynamic Tables. InCompanion of the 2025 International Conference on Management of Data. 622–634

  14. [14]

    Dixin Tang et al. 2020. CrocodileDB in Action: Resource-Efficient Query Execu- tion by Exploiting Time Slackness.Proc. VLDB Endow.13, 12 (2020), 2937–2940

  15. [15]

    Elmore, Sanjay Krishnan, and Michael J

    Dixin Tang, Zechao Shang, Aaron J. Elmore, Sanjay Krishnan, and Michael J. Franklin. 2019. Intermittent Query Processing.Proc. VLDB Endow.12, 11 (2019), 1427–1441

  16. [16]

    Elmore, Sanjay Krishnan, and Michael J

    Dixin Tang, Zechao Shang, Aaron J. Elmore, Sanjay Krishnan, and Michael J. Franklin. 2020. Thrifty Query Execution via Incrementability. InACM SIGMOD. ACM, 1241–1256

  17. [17]

    Vinod Kumar Vavilapalli et al. 2013. Apache Hadoop YARN: yet another resource negotiator. InACM Symposium on Cloud Computing, SOCC. 5:1–5:16. https: //doi.org/10.1145/2523616.2523633

  18. [18]

    Guolu Wang, Jungang Xu, Renfeng Liu, and Shanshan Huang. 2018. A Hard Real-time Scheduler for Spark on YARN. InIEEE/ACM CCGRID. 645–652

  19. [19]

    Zuozhi Wang et al. 2020. Grosbeak: A Data Warehouse Supporting Resource- Aware Incremental Computing. InACM SIGMOD. 2797–2800

  20. [20]

    Jie Zhu, Xiaoping Li, Rubén Ruiz, and Xiaolong Xu. 2018. Scheduling stochastic multi-stage jobs to elastic hybrid cloud resources.IEEE Transactions on Parallel and Distributed Systems29, 6 (2018), 1401–1415. 15