Recognition: unknown
Decentralized Stratified Sampling for Low-Latency Approximate Geospatial Data Stream Processing in Edge-Cloud Architectures
Pith reviewed 2026-05-09 15:59 UTC · model grok-4.3
The pith
Decentralized geohash-stratified sampling at edge nodes enables faster geospatial data stream processing with controlled approximation errors.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that its decentralized stratified sampling algorithm, based on geohash partitioning and executed locally at each edge node, delivers significant reductions in processing latency for geospatial streams compared to centralized cloud approaches, all while holding approximation errors to less than ten percent mean absolute percentage error at an eighty percent sampling rate, and that coarser geohash levels can further lower those errors by thirty percent.
What carries the argument
A decentralized geohash-stratified sampling procedure that operates independently on each edge node to select spatially representative data points without requiring any coordination among the nodes.
Load-bearing premise
The premise that geohash-based stratification performed independently at each edge node will produce samples that are sufficiently representative of the global spatial distribution for accurate downstream analytics.
What would settle it
Finding that the mean absolute percentage error for aggregate queries on a new clustered geospatial dataset exceeds ten percent when using the eighty percent sampling rate would falsify the maintained accuracy claim.
read the original abstract
The exponential growth of geospatial data streams flowing from IoT devices challenges conventional cloud-based analytics, which typically suffer from network bandwidth waste and latency, basically attributed to the data being managed completely by Cloud, such as centralized sampling. To address this gap, we propose EdgeApproxGeo, a novel edge-cloud architecture that performs spatial-stratified online sampling at network edge devices near data sources. Our system introduces a novel sampling method called EdgeSOS, which is a unique decentralized, geohash-based stratified sampling algorithm designed to operate independently at resource-constrained edge nodes without cross-node synchronization, coupled with spatial-aware data distribution and topic routing in Apache Kafka data stream ingestion, aiming at optimizing downstream data stream processing analytics. We evaluated our system on two real-world geo-referenced datasets, mobility and air quality, and EdgeApproxGeo achieves a significant speedup over cloud-only baselines while maintaining errors in check (e.g., MAPE < 10% error rate at 80% sampling rate). We further demonstrate that coarser geohash granularity (e.g., Geohash-5) can reduce error figures by 30% as compared to finer counterparts (i.e., Geohash-6), thus revealing a tunable accuracy-efficiency trade-off. Our standard-compliant prototype, built atop Apache Kafka and Apache Spark, further validates the utility of edge-deployed approximate query processing for real-time big geospatial data analytics.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents EdgeApproxGeo, an edge-cloud architecture for approximate geospatial data stream processing. It introduces EdgeSOS, a decentralized geohash-based stratified sampling algorithm that operates independently at resource-constrained edge nodes without cross-node synchronization, integrated with Apache Kafka for spatial-aware routing. Evaluation on mobility and air quality datasets claims significant speedup over cloud-only baselines with MAPE below 10% at 80% sampling rate, plus a 30% error reduction using coarser geohash granularity (Geohash-5 vs. Geohash-6), supported by a Kafka-Spark prototype.
Significance. If the performance claims hold under the stated conditions, the work could contribute to low-latency edge analytics for IoT geospatial streams by reducing cloud bandwidth usage. Positive elements include evaluation on two real-world datasets and a standards-compliant prototype implementation. The significance is limited by the central reliance on an untested assumption about spatial representativeness in the absence of synchronization.
major comments (2)
- [EdgeSOS algorithm description] EdgeSOS algorithm (description in the methods section): the claim of maintaining MAPE <10% at 80% sampling with independent per-node geohash stratification without synchronization or count exchange is load-bearing for both the speedup and accuracy results. This setup implicitly assumes local stratum sizes are balanced or imbalances average out globally, but no analysis, proof, or experiments under uneven cross-node geohash distributions (common in mobility data) are provided to support this.
- [Evaluation] Evaluation section: the reported metrics (MAPE <10% at 80% sampling, 30% error reduction with Geohash-5) are presented without details on data partitioning across nodes, number of edge nodes, exact baseline implementations, variance across runs, or sensitivity to distribution skew. This undermines verification of whether the no-synchronization assumption holds for the tested datasets.
minor comments (2)
- [Abstract] Abstract: the sentence 'basically attributed to the data being managed completely by Cloud, such as centralized sampling' is imprecise and should be clarified to distinguish between full data transfer and sampling strategies.
- [Methods] The manuscript would benefit from explicit pseudocode or a small example illustrating how EdgeSOS combines local samples into a global approximate result without synchronization.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below, acknowledging where additional details and analysis are needed, and indicate the revisions we will incorporate.
read point-by-point responses
-
Referee: [EdgeSOS algorithm description] EdgeSOS algorithm (description in the methods section): the claim of maintaining MAPE <10% at 80% sampling with independent per-node geohash stratification without synchronization or count exchange is load-bearing for both the speedup and accuracy results. This setup implicitly assumes local stratum sizes are balanced or imbalances average out globally, but no analysis, proof, or experiments under uneven cross-node geohash distributions (common in mobility data) are provided to support this.
Authors: We agree that the no-synchronization assumption requires stronger support. The mobility dataset used in our evaluation naturally exhibits spatial clustering and uneven geohash distributions across locations, which implicitly tests the approach under realistic skew; the reported MAPE <10% at 80% sampling indicates that local proportional sampling per geohash maintains sufficient global representativeness on average. However, to directly address the concern, the revised manuscript will add a new subsection with analysis of cross-node imbalance, including synthetic experiments that vary the degree of geohash distribution skew across nodes and quantify the resulting MAPE deviation. revision: yes
-
Referee: [Evaluation] Evaluation section: the reported metrics (MAPE <10% at 80% sampling, 30% error reduction with Geohash-5) are presented without details on data partitioning across nodes, number of edge nodes, exact baseline implementations, variance across runs, or sensitivity to distribution skew. This undermines verification of whether the no-synchronization assumption holds for the tested datasets.
Authors: We acknowledge that the current evaluation section omits key implementation and experimental details needed for verification. In the revised manuscript, we will expand the evaluation to explicitly state the number of edge nodes (10), the data partitioning method (geographic partitioning by geohash prefix to emulate edge deployments), the precise baseline configurations (cloud-only full-data processing and centralized sampling), standard deviations across five independent runs, and additional sensitivity results under controlled distribution skew. These changes will enable direct assessment of the no-synchronization assumption on the tested datasets. revision: yes
Circularity Check
No significant circularity; claims rest on external empirical evaluation
full rationale
The paper introduces EdgeApproxGeo and the EdgeSOS sampling algorithm as a novel decentralized method, with all performance claims (speedup over baselines, MAPE <10% at 80% sampling, 30% error reduction from coarser geohash) supported by direct measurements on two independent real-world datasets (mobility and air quality). No equations, fitted parameters, self-referential predictions, or load-bearing self-citations appear in the provided text that would reduce any result to its own inputs by construction. The evaluation uses external benchmarks rather than synthetic or self-generated data, rendering the derivation chain self-contained.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Geohash cells provide effective spatial stratification that preserves representativeness for downstream queries
- domain assumption Decentralized operation at edge nodes can be performed without cross-node communication or synchronization
invented entities (1)
-
EdgeSOS algorithm
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Kandiboina, R., Knickerbocker, S., Bhagat, S., Hawkins, N., Sharma, A.: Explor- ing the efficacy of large-scale connected vehicle data in real-time traffic applica- tions. Transportation Research Record2678(5), 651–665 (2024) https://doi.org/ 10.1177/03611981231191512 https://doi.org/10.1177/03611981231191512
-
[2]
Journal of Transport Geography115, 103815 (2024) https: //doi.org/10.1016/j.jtrangeo.2024.103815
Ghaffarpasand, O., Pope, F.D.: Telematics data for geospatial and temporal mapping of urban mobility: New insights into travel characteristics and vehi- cle specific power. Journal of Transport Geography115, 103815 (2024) https: //doi.org/10.1016/j.jtrangeo.2024.103815
-
[3]
In: 2025 IEEE International Conference on Smart Computing (SMARTCOMP), pp
Colarusso, C., Falco, I., Zimeo, E.: Fog-cloud interpolation of urban monitor- ing data collected by lorawan networks. In: 2025 IEEE International Conference on Smart Computing (SMARTCOMP), pp. 146–153 (2025). https://doi.org/10. 1109/SMARTCOMP65954.2025.00064
-
[4]
Falco, I., Colarusso, C., Zimeo, E.: U-track: A mobile sensing fog platform for real-time urban monitoring. In: 2025 21st International Conference on Distributed Computing in Smart Systems and the Internet of Things (DCOSS-IoT), pp. 665– 672 (2025). https://doi.org/10.1109/DCOSS-IoT65416.2025.00104
-
[5]
In: GLOBECOM 2020-2020 IEEE Global Communications Conference, pp
Al Jawarneh, I.M., Bellavista, P., Corradi, A., Foschini, L., Montanari, R.: Locality-preserving spatial partitioning for geo big data analytics in main mem- ory frameworks. In: GLOBECOM 2020-2020 IEEE Global Communications Conference, pp. 1–6 (2020). IEEE
2020
-
[6]
Al Jawarneh, I.M., Bellavista, P., Corradi, A., Foschini, L., Montanari, R.: Effi- cient QoS-aware spatial join processing for scalable NoSQL storage frameworks. IEEE Transactions on Network and Service Management18(2), 2437–2449 (2021) https://doi.org/10.1109/TNSM.2020.3034150
-
[7]
Combining spectral and textural information in uav hyperspectral images to estimate rice grain yield
Huang, X., Wang, S., Lu, T., Liu, Y., Serrano-Estrada, L.: Crowdsourced geospa- tial data is reshaping urban sciences. International Journal of Applied Earth Observation and Geoinformation127, 103687 (2024) https://doi.org/10.1016/j. jag.2024.103687
work page doi:10.1016/j 2024
-
[8]
IEEE Access7, 25123–25137 (2019)
Wei, X., Liu, Y., Wang, X., Gao, S., Chen, L.: Online adaptive approximate stream processing with customized error control. IEEE Access7, 25123–25137 (2019)
2019
-
[9]
Data Science and Engineering3(4), 379–397 (2018)
Li, K., Li, G.: Approximate query processing: What is new and where to go? a survey on approximate query processing. Data Science and Engineering3(4), 379–397 (2018)
2018
-
[10]
In: 2019 IEEE Global Communications 41 Conference (GLOBECOM), pp
Al Jawarneh, I.M., Bellavista, P., Foschini, L., Montanari, R.: Spatial-aware approximate big data stream processing. In: 2019 IEEE Global Communications 41 Conference (GLOBECOM), pp. 1–6 (2019). IEEE
2019
-
[11]
In: 2020 IEEE 25th International Workshop on Computer Aided Modeling and Design of Communication Links and Networks (CAMAD), pp
Al Jawarneh, I.M., Bellavista, P., Corradi, A., Foschini, L., Montanari, R.: Spatially representative online big data sampling for smart cities. In: 2020 IEEE 25th International Workshop on Computer Aided Modeling and Design of Communication Links and Networks (CAMAD), pp. 1–6 (2020). IEEE
2020
-
[12]
IEEE Transactions on Parallel and Distributed Systems35(1), 73–88 (2023)
Al Jawarneh, I.M., Bellavista, P., Corradi, A., Foschini, L., Montanari, R.: Spa- tialSSJP: Qos-aware adaptive approximate stream-static spatial join processor. IEEE Transactions on Parallel and Distributed Systems35(1), 73–88 (2023)
2023
-
[13]
In: 2019 USENIX Annual Technical Conference (USENIX ATC 19), pp
Fu, X., Ghaffar, T., Davis, J.C., Lee, D.:{EdgeWise}: A better stream processing engine for the edge. In: 2019 USENIX Annual Technical Conference (USENIX ATC 19), pp. 929–946 (2019)
2019
-
[14]
IEEE Transactions on Mobile Computing (2025)
Ching, C.-W., Chen, X., Kim, C., Wang, T., Chen, D., Da Silva, D., Hu, L.: AgileDART: An agile and scalable edge stream processing engine. IEEE Transactions on Mobile Computing (2025)
2025
-
[15]
In: 2021 IEEE/ACM Symposium on Edge Computing (SEC), pp
Kumar, D., Ahmad, S., Chandra, A., Sitaraman, R.K.: Aggnet: Cost-aware aggre- gation networks for geo-distributed streaming analytics. In: 2021 IEEE/ACM Symposium on Edge Computing (SEC), pp. 297–311 (2021). IEEE
2021
-
[16]
Computer Communications183, 121–135 (2022)
Zhang, D.-g., Ni, C.-h., Zhang, J., Zhang, T., Yang, P., Wang, J.-x., Yan, H.- r.: A novel edge computing architecture based on adaptive stratified sampling. Computer Communications183, 121–135 (2022)
2022
-
[17]
IEEE Internet of Things Journal11(18), 30094–30113 (2024)
Huang, W., Deng, X.: GeoEkuiper: A cloud-cooperated geospatial edge stream processing engine for resource-constrained iot devices with higher throughput. IEEE Internet of Things Journal11(18), 30094–30113 (2024)
2024
-
[18]
Sensors 21(12), 4160 (2021)
Al Jawarneh, I.M., Bellavista, P., Corradi, A., Foschini, L., Montanari, R.: QoS- aware approximate query processing for smart cities spatial data streams. Sensors 21(12), 4160 (2021)
2021
-
[19]
Thompson, S.K.: Sampling vol. 755. John Wiley & Sons, NJ, USA (2012)
2012
-
[20]
In: Companion of the 2025 International Conference on Management of Data, pp
Michalke, A., Lepping, A., Markl, V., Martinez, R., Schubert, N., Schwerdtfeger, L., Tekdogan, T., Zeuch, S., Ziehn, A., Falkensteiner, C.,et al.: NebulaStream: An extensible, high-performance streaming engine for multi-modal edge applications. In: Companion of the 2025 International Conference on Management of Data, pp. 195–198 (2025)
2025
-
[21]
In: Proceedings of the 18th ACM International Conference on Distributed and Event-based Systems, pp
Markl, V.: NebulaStream-data stream processing in massively distributed, het- erogeneous, volatile environments. In: Proceedings of the 18th ACM International Conference on Distributed and Event-based Systems, pp. 1–3 (2024) 42
2024
-
[22]
Sustainability12(23), 10175 (2020)
Abdullah, F., Peng, L., Tak, B.: Fossel: efficient latency reduction in approximat- ing streaming sensor data. Sustainability12(23), 10175 (2020)
2020
-
[23]
In: 2020 IEEE/ACM Symposium on Edge Computing (SEC), pp
Wolfrath, J., Chandra, A.: Poster: Data-aware edge sampling for aggregate query approximation. In: 2020 IEEE/ACM Symposium on Edge Computing (SEC), pp. 158–160 (2020). IEEE
2020
-
[24]
In: 2020 11th International Conference on Network of the Future (NoF), pp
Hafeez, T., McArdle, G., Xu, L.: Adaptive window based sampling on the edge for internet of things data streams. In: 2020 11th International Conference on Network of the Future (NoF), pp. 105–109 (2020). IEEE
2020
-
[25]
In: 2022 IEEE International Conference on Cloud Engineering (IC2E), pp
Wolfrath, J., Chandra, A.: Efficient transmission and reconstruction of dependent data streams via edge sampling. In: 2022 IEEE International Conference on Cloud Engineering (IC2E), pp. 47–57 (2022). IEEE
2022
-
[26]
In: 2018 IEEE/ACM 11th International Conference on Utility and Cloud Computing (UCC), pp
Georgiou, Z., Symeonides, M., Trihinas, D., Pallis, G., Dikaiakos, M.D.: Stream- sight: A query-driven framework for streaming analytics in edge computing. In: 2018 IEEE/ACM 11th International Conference on Utility and Cloud Computing (UCC), pp. 143–152 (2018). IEEE
2018
-
[27]
IEEE Transactions on Cloud Computing10(2), 1462–1475 (2020)
Wu, G., Yun, X., Wang, Y., Wang, S., Li, B., Liu, Y.: A sketching approach for obtaining real-time statistics over data streams in cloud. IEEE Transactions on Cloud Computing10(2), 1462–1475 (2020)
2020
-
[28]
Computer Communications164, 77–87 (2020)
Alencar, B.M., Rios, R.A., Santana, C., Prazeres, C.: FoT-Stream: A fog platform for data stream analytics in iot. Computer Communications164, 77–87 (2020)
2020
-
[29]
ACM Transac- tions on Multimedia Computing, Communications and Applications21(8), 1–15 (2025)
Ye, P.-G., Wang, W., Mi, B., Chen, K.: EdgeStreaming: Secure computation intelligence in distributed edge networks for streaming analytics. ACM Transac- tions on Multimedia Computing, Communications and Applications21(8), 1–15 (2025)
2025
-
[30]
IEEE Transactions on Cloud Computing10(2), 792–804 (2020)
Ali, M., Anjum, A., Rana, O., Zamani, A.R., Balouek-Thomert, D., Parashar, M.: RES: Real-time video stream analytics using edge enhanced clouds. IEEE Transactions on Cloud Computing10(2), 792–804 (2020)
2020
-
[31]
In: Proceedings of the NetDB, vol
Kreps, J., Narkhede, N., Rao, J.,et al.: Kafka: A distributed messaging system for log processing. In: Proceedings of the NetDB, vol. 11, pp. 1–7 (2011). Athens, Greece
2011
-
[32]
In: Proceedings of the 2018 International Conference on Management of Data
Armbrust, M., Das, T., Torres, J., Yavuz, B., Zhu, S., Xin, R., Ghodsi, A., Stoica, I., Zaharia, M.: Structured streaming: A declarative API for real-time appli- cations in apache spark. In: Proceedings of the 2018 International Conference on Management of Data. SIGMOD ’18, pp. 601–613. Association for Comput- ing Machinery, New York, NY, USA (2018). http...
-
[33]
In: 2nd USENIX Workshop on Hot Topics in Cloud Computing (HotCloud 10) (2010)
Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: Cluster computing with working sets. In: 2nd USENIX Workshop on Hot Topics in Cloud Computing (HotCloud 10) (2010)
2010
-
[34]
Communications of the ACM59(11), 56–65 (2016)
Zaharia, M., Xin, R.S., Wendell, P., Das, T., Armbrust, M., Dave, A., Meng, X., Rosen, J., Venkataraman, S., Franklin, M.J.,et al.: Apache spark: a unified engine for big data processing. Communications of the ACM59(11), 56–65 (2016)
2016
-
[35]
In: The 25th Annual International Conference on Mobile Computing and Networking, pp
Wang, G., Chen, X., Zhang, F., Wang, Y., Zhang, D.: Experience: Understanding long-term evolving patterns of shared electric vehicle networks. In: The 25th Annual International Conference on Mobile Computing and Networking, pp. 1–12 (2019)
2019
-
[36]
In: 2022 21st ACM/IEEE International Conference on Information Processing in Sensor Networks (IPSN), pp
Daepp, M.I., Cabral, A., Ranganathan, V., Iyer, V., Counts, S., Johns, P., Roseway, A., Catlett, C., Jancke, G., Gehring, D.,et al.: Eclipse: An end-to- end platform for low-cost, hyperlocal environmental sensing in cities. In: 2022 21st ACM/IEEE International Conference on Information Processing in Sensor Networks (IPSN), pp. 28–40 (2022). IEEE 44
2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.