pith. sign in

arxiv: 1907.03736 · v2 · pith:2ZKENKIEnew · submitted 2019-07-08 · 💻 cs.DB

LocationSpark: In-memory Distributed Spatial Query Processing and Optimization

Pith reviewed 2026-05-25 00:42 UTC · model grok-4.3

classification 💻 cs.DB
keywords spatial query processingdistributed systemsquery optimizationbitmap filtersquery skewin-memory computingSparkspatial indexing
0
0 comments X

The pith

A distributed query scheduler with a new cost model and bitmap filters lets spatial queries run up to an order of magnitude faster inside Spark by reducing skew effects.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents LocationSpark as an extension inside Spark that adds a scheduler to handle query skew common in spatial workloads. The scheduler relies on a cost model to build execution plans and on bitmap filters to route each query to the right nodes, after which each node runs its own local optimization. This combination is meant to cut both skew-induced imbalance and communication volume. If the approach works, large spatial datasets from mapping or location services could be queried at scale without the slowdowns seen in prior distributed engines. The authors prototype the full set of techniques and test them on real data to show the claimed gains.

Core claim

LocationSpark shows that a distributed query scheduler using a new cost model to generate skew-minimizing plans, together with bitmap-filter spatial indexes that forward queries to the correct local nodes, enables each node to select its own best execution plan; when implemented in Spark this produces up to an order of magnitude faster spatial query processing than existing in-memory and distributed spatial systems.

What carries the argument

The distributed query scheduler that applies a new cost model to plan executions and bitmap filters to forward queries, plus per-node local optimization based on local indexes.

If this is right

  • Query skew effects are reduced by the scheduler's cost-model plans.
  • Communication costs drop because bitmap filters limit unnecessary data movement.
  • Each local node independently chooses the best plan for its data and query type.
  • The full set of techniques runs inside the Spark execution engine.
  • Real-dataset experiments confirm up to 10x gains over prior distributed spatial engines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same scheduler-plus-filter pattern could be ported to other in-memory engines that already support spatial indexes.
  • Dynamic adjustment of the cost model might handle streaming spatial data where skew changes over time.
  • The routing mechanism could be combined with existing spatial partitioning methods to further lower cross-node traffic.
  • Similar bitmap-filter forwarding might help non-spatial workloads that also suffer from data skew, such as graph traversals.

Load-bearing premise

The new cost model and bitmap-filter forwarding correctly identify and mitigate query skew without introducing offsetting overheads or incorrect routing decisions.

What would settle it

Run identical high-skew spatial range and join queries on the same real datasets using both LocationSpark and an existing system such as GeoSpark, then measure whether the observed runtime ratio stays near 10x or falls below 2x once scheduler and filter overhead is included.

Figures

Figures reproduced from arXiv: 1907.03736 by Ahmed R. Mahmood, Mingjie Tang, Mourad Ouzzani, Qutaibah M. Malluhi, Walid G. Aref, Yongyang Yu.

Figure 1
Figure 1. Figure 1: Illustration of spatial join operators. Circles cen [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: LOCATIONSPARK system architecture 2.2 Overview of In-memory Distributed Spatial Query Processing To facilitate spatial query processing, we build a dis￾tributed spatial index for in-memory spatial data. Given a spatial dataset D, we obtain samples from D and construct a spatial index (e.g., an R-tree) with N leaves over the samples. We refer to this index on the sam￾ple data as the global spatial index. Ne… view at source ↗
Figure 3
Figure 3. Figure 3: Execution plan for spatial range join. The red lines [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Evaluation of local spatial join algorithms [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: a gives the performance of the specialized kNN join approaches within a local computation node when varying k from 10 to 150. The nestQtree approach always performs the best, followed by nestRtree, sfcurve, pgbjk, and spitfire. Notice that block-based approaches induce extensive amounts of kNN candidates for query points in the same block, and it directly degrades the 1 2 3 4 5 6 7 8 9 10 x 104 103 104 105… view at source ↗
Figure 6
Figure 6. Figure 6: sFilter structure (up left), the related data (up [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: The performance of spatial range join 0 2 4 6 8 10 12 14 16 18 20 102 103 104 Data Size of Inner Table (Millon) Query Time(seconds) log LocationSpark LocationSpark(opt) Simba Simba(opt) (a) Twitter 0 10 20 30 40 50 60 70 80 90 100 102 103 104 Data Size of Inner Table (Millon) Query Time(seconds) log LocationSpark LocationSpark(opt) Simba Simba(opt) (b) OSMP [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Performance of kNN join by increasing the num￾ber of data points achieve 5 times speedup against with Simba, since sFilter can reduce redundant data partitions searching. This is also observed from the Simba(opt) (i.e., with sFilter) can achieve comparable performance with LOCATION￾SPARK. Performance results (mainly, the execution times of the spatial range join) are listed in [PITH_FULL_IMAGE:figures/ful… view at source ↗
Figure 9
Figure 9. Figure 9: Performance of spatial range join on various query [PITH_FULL_IMAGE:figures/full_fig_p012_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: The effect of the sFilter on reducing the shuffle [PITH_FULL_IMAGE:figures/full_fig_p013_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Performance of spatial range join and kNN join when varying the number of executors overviews. Gaede and Gunther [10] provide a good ¨ summary of spatial data indexing. Sowell et al. give a survey and experimental study of iterative spatial-join in memory [20]. Recently, there has been considerable interest in supporting spatial data management over Hadoop MapReduce. Hadoop-GIS [3] supports spatial querie… view at source ↗
read the original abstract

Due to the ubiquity of spatial data applications and the large amounts of spatial data that these applications generate and process, there is a pressing need for scalable spatial query processing. In this paper, we present new techniques for spatial query processing and optimization in an in-memory and distributed setup to address scalability. More specifically, we introduce new techniques for handling query skew, which is common in practice, and optimize communication costs accordingly. We propose a distributed query scheduler that use a new cost model to optimize the cost of spatial query processing. The scheduler generates query execution plans that minimize the effect of query skew. The query scheduler employs new spatial indexing techniques based on bitmap filters to forward queries to the appropriate local nodes. Each local computation node is responsible for optimizing and selecting its best local query execution plan based on the indexes and the nature of the spatial queries in that node. All the proposed spatial query processing and optimization techniques are prototyped inside Spark, a distributed memory-based computation system. The experimental study is based on real datasets and demonstrates that distributed spatial query processing can be enhanced by up to an order of magnitude over existing in-memory and distributed spatial systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript presents LocationSpark, a Spark-based prototype for in-memory distributed spatial query processing. It introduces a distributed query scheduler that employs a new cost model to generate execution plans minimizing query skew effects, bitmap-filter-based spatial indexing to forward queries to appropriate local nodes, and per-node local optimization of query plans. Experiments on real datasets are reported to demonstrate up to an order-of-magnitude improvement over existing in-memory and distributed spatial systems.

Significance. If the performance claims hold under rigorous validation, the work would be significant for practical distributed spatial analytics by directly targeting query skew, a frequent real-world bottleneck, and by providing a Spark-integrated implementation that could facilitate adoption and further extensions in scalable spatial data systems.

major comments (2)
  1. [Abstract and experimental study] Abstract and § on experimental study: the central claim of 'up to an order of magnitude' improvement supplies no baselines, dataset sizes or characteristics, error bars, or query types, preventing evaluation of whether the cost model and bitmap filters deliver net gains or are offset by their own overheads.
  2. [Distributed query scheduler] Description of the distributed query scheduler: no equations, pseudocode, or formal definition is given for the new cost model that is asserted to optimize communication and local computation costs while mitigating skew; without this it is impossible to determine whether the model introduces incorrect routing or extra data movement as raised by the stress-test concern.
minor comments (1)
  1. The abstract could more explicitly list the supported spatial query types (e.g., range, kNN) and the exact Spark version or extensions used.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments highlight opportunities to improve the clarity of our performance claims and the formal presentation of the cost model. We address each point below and commit to revisions that strengthen the manuscript without altering its core contributions.

read point-by-point responses
  1. Referee: [Abstract and experimental study] Abstract and § on experimental study: the central claim of 'up to an order of magnitude' improvement supplies no baselines, dataset sizes or characteristics, error bars, or query types, preventing evaluation of whether the cost model and bitmap filters deliver net gains or are offset by their own overheads.

    Authors: We agree that the abstract and experimental study section would benefit from additional specifics to allow readers to fully evaluate the claims. In the revised manuscript, we will update the abstract to reference the real-world datasets (including sizes and characteristics), the query types evaluated, and the baselines compared against. The experimental study section will be expanded to explicitly report dataset details, include error bars on performance results, specify query workloads, and provide direct comparisons demonstrating that reported gains exceed any overhead from the cost model and bitmap filters. revision: yes

  2. Referee: [Distributed query scheduler] Description of the distributed query scheduler: no equations, pseudocode, or formal definition is given for the new cost model that is asserted to optimize communication and local computation costs while mitigating skew; without this it is impossible to determine whether the model introduces incorrect routing or extra data movement as raised by the stress-test concern.

    Authors: We acknowledge that a more formal and precise presentation of the cost model is needed. In the revision, we will add the mathematical equations defining the cost model (capturing communication, local computation, and skew mitigation terms), along with pseudocode for the distributed query scheduler. This will enable assessment of routing decisions and data movement. We will also expand the discussion to address potential stress-test scenarios, including any analysis of incorrect routing or overhead. revision: yes

Circularity Check

0 steps flagged

No circularity; system description with external experimental validation

full rationale

The paper presents a distributed spatial query system prototyped in Spark, introducing a scheduler with a cost model and bitmap filters for skew handling. No equations, fitted parameters, or self-citations are invoked as load-bearing derivations. Claims rest on experimental results using real datasets compared against existing systems, which are independent benchmarks. This matches the default case of a self-contained engineering contribution with no reduction of outputs to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract states no free parameters, mathematical axioms, or new invented entities; the contribution consists of engineering techniques and a prototype implementation.

pith-pipeline@v0.9.0 · 5752 in / 1026 out tokens · 27527 ms · 2026-05-25T00:42:14.165767+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages

  1. [1]

    https://github.com/harsha2010/magellan

    Magellan. https://github.com/harsha2010/magellan

  2. [2]

    http://simin.me/projects/spatialspark/

    Spatialspark. http://simin.me/projects/spatialspark/

  3. [3]

    A. Aji, F. Wang, H. Vo, R. Lee, Q. Liu, X. Zhang, and J. Saltz. Hadoop gis: A high performance spatial data warehousing system over mapreduce. VLDB, 2013

  4. [4]

    Alexiou, D

    K. Alexiou, D. Kossmann, and P .-A. Larson. Adaptive range filters for cold data: Avoiding trips to siberia. In VLDB, 2013

  5. [5]

    A. M. Aly, A. R. Mahmood, M. S. Hassan, W. G. Aref, M. Ouzzani, H. Elmeleegy, and T. Qadah. AQWA: adaptive query-workload- aware partitioning of big spatial data. In VLDB, 2015

  6. [6]

    Brinkhoff, H.-P

    T. Brinkhoff, H.-P . Kriegel, and B. Seeger. Efficient processing of spatial joins using r-trees. SIGMOD Rec., 1993

  7. [7]

    Calderoni, P

    L. Calderoni, P . Palmieri, and D. Maio. Location privacy without mutual trust. Comput. Commun

  8. [8]

    Chatzimilioudis, C

    G. Chatzimilioudis, C. Costa, D. Zeinalipour-Yazti, W. Lee, and E. Pitoura. Distributed in-memory processing of all k nearest neighbor queries. TKDE, 2016

  9. [9]

    Eldawy and M

    A. Eldawy and M. Mokbel. Spatialhadoop: A mapreduce frame- work for spatial data. In ICDE, 2015

  10. [10]

    Gaede and O

    V . Gaede and O. G ¨unther. Multidimensional access methods. ACM Comput. Surv. , 30:170–231, 1998

  11. [11]

    M. R. Garey and D. S. Johnson. Computers and Intractability: A Guide to the Theory of NP-Completeness . W. H. Freeman & Co., New York, NY, USA, 1979

  12. [12]

    J. E. Gonzalez, R. S. Xin, A. Dave, D. Crankshaw, M. J. Franklin, and I. Stoica. Graphx: Graph processing in a distributed dataflow framework. In OSDI, 2014

  13. [13]

    Y. Kwon, M. Balazinska, B. Howe, and J. Rolia. Skew-resistant parallel processing of feature-extracting scientific user-defined functions. In SoCC, 2010

  14. [14]

    Y. Kwon, M. Balazinska, B. Howe, and J. Rolia. Skewtune: Mitigating skew in mapreduce applications. In SIGMOD, 2012

  15. [15]

    W. Lu, Y. Shen, S. Chen, and B. C. Ooi. Efficient processing of k nearest neighbor joins using mapreduce. Proc. VLDB Endow. , 2012

  16. [16]

    Mingjie, Y

    T. Mingjie, Y. Yongyang, G. A. Walid, M. M. Qutaibah, O. Mourad, and R. Ahmed. Locationspark: A distributed in-memory data management system for big spatial data. Purdue technical report , 2016

  17. [17]

    Nishimura, S

    S. Nishimura, S. Das, D. Agrawal, and A. Abbadi. Md-hbase: A scalable multi-dimensional data infrastructure for location aware services. In MDM 11

  18. [18]

    H. Samet. Foundations of Multidimensional and Metric Data Struc- tures. Morgan Kaufmann Publishers Inc., 2005

  19. [19]

    Shvachko, H

    K. Shvachko, H. Kuang, S. Radia, and R. Chansler. The hadoop distributed file system. In MSST, 2010

  20. [20]

    Sowell, M

    B. Sowell, M. V . Salles, T. Cao, A. Demers, and J. Gehrke. An experimental analysis of iterated spatial joins in main memory. Proc. VLDB Endow. , 2013

  21. [21]

    M. Tang, Y. Yu, W. G. Aref, Q. Malluhi, and M. Ouzzani. Loca- tionspark: A distributed in-memory data management system for big spatial data. VLDB, 2016

  22. [22]

    J. S. Vitter. Random sampling with a reservoir. ACM Trans. Math. Softw., 1985

  23. [23]

    C. Xia, H. Lu, B. Chin, and O. J. Hu. Gorder: An efficient method for knn join processing. In VLDB, 2004

  24. [24]

    D. Xie, F. Li, B. Yao, G. Li, L. Zhou, and M. Guo. Simba: Efficient in-memory spatial analytics. In SIGMOD, 2016

  25. [25]

    J. Yu, J. Wu, and M. Sarwat. Geospark: A cluster computing framework for processing large-scale spatial data. In SIGSP ATIAL, 2015

  26. [26]

    M. Zaharia. An Architecture for Fast and General Data Processing on Large Clusters . Association for Computing Machinery and Morgan, 2016. PLACE PHOTO HERE Mingjie Tang is member of technical stuff at Hortonworks. He won his PhD at Purdue Uni- versity. His research interests include database system and big data infrastructure. He has an MS in computer sci...