Elastic Scheduling of Intermittent Query Processing in a Cluster Environment
Pith reviewed 2026-05-20 23:31 UTC · model grok-4.3
The pith
Elastic scheduling for batched queries in clusters meets deadlines at minimum cost by scaling nodes dynamically.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that intermittent batched processing of tuples in an elastic parallel environment can be scheduled by scaling nodes to meet deadlines after each window while minimizing overall cost, with the schemes extending to concurrent queries, new query arrivals, and input rate variations.
What carries the argument
Scheduling schemes that allocate nodes for each batch to minimize cost subject to deadline constraints in an elastic setting.
If this is right
- Deadlines are met for window-based queries even when multiple queries run concurrently.
- New queries can arrive and be incorporated without restarting the schedule.
- Input rate changes are absorbed by adjusting node counts between batches.
- Evaluations show lower cost than fixed clusters or eager Spark streaming on standard benchmarks.
Where Pith is reading between the lines
- This approach suggests cloud stream analytics can run at lower hourly expense by avoiding constant over-provisioning.
- The same batch-and-scale logic could transfer to other distributed engines that support dynamic resource allocation.
- Better runtime prediction of batch times would make the cost savings more reliable across varying workloads.
Load-bearing premise
That processing times, scaling costs, and deadline constraints can be modeled accurately enough to allow optimization of node allocation for minimum cost without violating deadlines in real elastic environments.
What would settle it
An experiment in which the elastic scheduler misses a deadline or uses more total cost than a fixed-node baseline under realistic rate fluctuations would falsify the central claim.
Figures
read the original abstract
Many applications process a stream of tuples over a window duration, and require the results within a specified deadline after the end of the window. For such scenarios, processing tuples intermittently (in batches) instead of eagerly processing tuples as they arrive significantly reduces the overall cost. Earlier work on intermittent query processing has addressed only fixed environments. In this paper, we propose scheduling schemes for batched processing of tuples, in an elastic parallel environment, scaling nodes up or down. Our scheduling schemes ensure to meet the deadlines, while incurring minimum cost. Our schemes also handle multiple concurrent queries, the arrival of new queries, and input rate variations. We have implemented our schemes on top of Apache Spark, in the AWS EMR environment, and evaluated performance with both TPC-H and Yahoo Streaming datasets. Our experimental results show that our scheduling algorithms significantly outperform alternatives, such as using a fixed set of nodes without elasticity, or using Spark streaming.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes scheduling schemes for intermittent batched query processing over windows in an elastic parallel cluster (Apache Spark on AWS EMR). The schemes dynamically scale nodes to meet deadlines at minimum cost, while also handling concurrent queries, new query arrivals, and input-rate variations. Experiments on TPC-H and Yahoo Streaming datasets report significant outperformance versus fixed-node allocations and standard Spark streaming.
Significance. If the modeling assumptions hold, the work addresses a practical gap by extending intermittent processing from fixed to elastic environments with concurrency support. The real-system implementation on Spark/AWS EMR and evaluation on standard benchmarks constitute concrete strengths that could inform cost-aware stream processing in cloud settings.
major comments (2)
- [§4] §4 (Scheduling Formulation): The deadline-meeting and minimum-cost claims rest on the assumption that per-tuple processing times, scaling latencies, and costs can be accurately pre-computed and remain stable; the manuscript provides no sensitivity analysis or error bounds showing robustness when these estimates deviate under real node-acquisition overheads and rate changes.
- [Experimental Evaluation section] Experimental Evaluation section, results tables: Reported outperformance lacks error bars, number of runs, or explicit methodology for measuring deadline violations and cost under concurrent queries and rate variations, leaving the central empirical claim difficult to verify against modeling inaccuracies.
minor comments (1)
- [§3] Notation for the cost and deadline constraints could be made more explicit with a single summary table of symbols.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on the modeling assumptions and experimental reporting. We address each major comment below and describe the revisions planned for the next version of the manuscript.
read point-by-point responses
-
Referee: [§4] §4 (Scheduling Formulation): The deadline-meeting and minimum-cost claims rest on the assumption that per-tuple processing times, scaling latencies, and costs can be accurately pre-computed and remain stable; the manuscript provides no sensitivity analysis or error bounds showing robustness when these estimates deviate under real node-acquisition overheads and rate changes.
Authors: We acknowledge the validity of this observation. The current formulation relies on profiled estimates for processing times and latencies. In the revised manuscript we will add a sensitivity analysis subsection to §4. This analysis will perturb per-tuple processing times and scaling latencies by ±25% and report the resulting changes in deadline compliance rates and total cost, thereby providing explicit error bounds under realistic estimation inaccuracies. revision: yes
-
Referee: Experimental Evaluation section, results tables: Reported outperformance lacks error bars, number of runs, or explicit methodology for measuring deadline violations and cost under concurrent queries and rate variations, leaving the central empirical claim difficult to verify against modeling inaccuracies.
Authors: We agree that additional statistical detail is needed. We will revise the Experimental Evaluation section to report results averaged over 10 independent runs, including error bars for standard deviation. We will also add explicit methodology text describing how deadline violations are counted (fraction of windows whose completion exceeds the deadline) and how cost is aggregated under concurrent queries and input-rate changes. revision: yes
Circularity Check
No circularity: empirical validation of scheduling schemes via implementation and external datasets
full rationale
The paper describes scheduling schemes for intermittent batched query processing in elastic clusters, with claims of deadline compliance at minimum cost supported by direct implementation on Apache Spark in AWS EMR and performance evaluation against TPC-H and Yahoo Streaming datasets. No derivation chain, equations, or first-principles results are presented that reduce by construction to fitted inputs, self-definitions, or self-citations; the outperformance claims rest on experimental comparisons to fixed-node and Spark streaming baselines rather than any internal fitting or renaming of known results. The work is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We propose scheduling schemes for batched processing of tuples, in an elastic parallel environment, scaling nodes up or down. Our scheduling schemes ensure to meet the deadlines, while incurring minimum cost.
-
IndisputableMonolith/Foundation/AlphaCoordinateFixation.leanJ_uniquely_calibrated_via_higher_derivative unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
The cost model in terms of cost per tuple and overhead cost is given in equation 2 TP = ((1−P) + P/Np) ∗ Nt ∗ CPT + ON + OX
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Benjamin Berg, Mor Harchol-Balter, Benjamin Moseley, Weina Wang, and Justin Whitehouse. 2020. Optimal resource allocation for elastic and inelastic jobs. InProceedings of the 32nd ACM Symposium on Parallelism in Algorithms and Architectures. 75–87
work page 2020
-
[2]
Saranya Chandrasekaran and S Sudarshan. 2024. Scheduling of intermittent query processing. InInternational Database Engineered Applications Symposium. Springer, 363–376
work page 2024
-
[3]
Dazhao Cheng, Yuan Chen, Xiaobo Zhou, Daniel Gmach, and Dejan Milojicic
-
[4]
Adaptive scheduling of parallel jobs in Spark Streaming. InIEEE INFOCOM. 1–9
-
[5]
Stratos Dimopoulos, Chandra Krintz, and Rich Wolski. 2017. Justice: A Deadline- Aware, Fair-Share Resource Allocator for Implementing Multi-Analytics. InIEEE CLUSTER. 233–244
work page 2017
-
[6]
Xiaofei Hou, TK Ashwin Kumar, Johnson P Thomas, and Hong Liu. 2017. Dy- namic deadline-constraint scheduler for Hadoop YARN. In2017 IEEE SmartWorld, Ubiquitous Intelligence & Computing, Advanced & Trusted Computed, Scalable Computing & Communications, Cloud & Big Data Computing, Internet of People and Smart City Innovation (SmartWorld/SCALCOM/UIC/ATC/CBDC...
work page 2017
-
[7]
Botong Huang, Lianggui Weng, Wei Chen, Kai Zeng, Yihui Feng, Bolin Ding, Jin- gren Zhou, Zuozhi Wang, and Chen Li. 2025. Agamotto: Scheduling of Deadline- Oriented Incremental Query Execution under Uncertain Resource Price.Pro- ceedings of the VLDB Endowment18, 6 (2025), 1852–1864
work page 2025
- [8]
-
[9]
Zhengyu Ou, Ge Yu, Yaxin Yu, Shanshan Wu, Xiaochun Yang, and Qingxu Deng
-
[10]
Tick Scheduling: A Deadline Based Optimal Task Scheduling Approach for Real-Time Data Stream Systems. InAdvances in Web-Age Inf. Management W AIM, Vol. 3739. 725–730
-
[11]
Zechao Shang et al. 2020. CrocodileDB: Efficient Database Execution through Intelligent Deferment. InConf. on Innovative Data Systems Research, CIDR
work page 2020
-
[12]
Subhajit Sidhanta, Wojciech Golab, and Supratik Mukhopadhyay. 2016. OptEx: A deadline-aware cost optimization model for spark. In2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid). IEEE, 193–202
work page 2016
-
[13]
Daniel Sotolongo, Daniel Mills, Tyler Akidau, Anirudh Santhiar, Attila-Péter Tóth, Botong Huang, Boyuan Zhang, Igor Belianski, Ling Geng, Matt Uhlar, et al. 2025. Streaming Democratized: Ease Across the Latency Spectrum with Delayed View Semantics and Snowflake Dynamic Tables. InCompanion of the 2025 International Conference on Management of Data. 622–634
work page 2025
-
[14]
Dixin Tang et al. 2020. CrocodileDB in Action: Resource-Efficient Query Execu- tion by Exploiting Time Slackness.Proc. VLDB Endow.13, 12 (2020), 2937–2940
work page 2020
-
[15]
Elmore, Sanjay Krishnan, and Michael J
Dixin Tang, Zechao Shang, Aaron J. Elmore, Sanjay Krishnan, and Michael J. Franklin. 2019. Intermittent Query Processing.Proc. VLDB Endow.12, 11 (2019), 1427–1441
work page 2019
-
[16]
Elmore, Sanjay Krishnan, and Michael J
Dixin Tang, Zechao Shang, Aaron J. Elmore, Sanjay Krishnan, and Michael J. Franklin. 2020. Thrifty Query Execution via Incrementability. InACM SIGMOD. ACM, 1241–1256
work page 2020
-
[17]
Vinod Kumar Vavilapalli et al. 2013. Apache Hadoop YARN: yet another resource negotiator. InACM Symposium on Cloud Computing, SOCC. 5:1–5:16. https: //doi.org/10.1145/2523616.2523633
-
[18]
Guolu Wang, Jungang Xu, Renfeng Liu, and Shanshan Huang. 2018. A Hard Real-time Scheduler for Spark on YARN. InIEEE/ACM CCGRID. 645–652
work page 2018
-
[19]
Zuozhi Wang et al. 2020. Grosbeak: A Data Warehouse Supporting Resource- Aware Incremental Computing. InACM SIGMOD. 2797–2800
work page 2020
-
[20]
Jie Zhu, Xiaoping Li, Rubén Ruiz, and Xiaolong Xu. 2018. Scheduling stochastic multi-stage jobs to elastic hybrid cloud resources.IEEE Transactions on Parallel and Distributed Systems29, 6 (2018), 1401–1415. 15
work page 2018
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.