LocationSpark: In-memory Distributed Spatial Query Processing and Optimization

Ahmed R. Mahmood; Mingjie Tang; Mourad Ouzzani; Qutaibah M. Malluhi; Walid G. Aref; Yongyang Yu

arxiv: 1907.03736 · v2 · pith:2ZKENKIEnew · submitted 2019-07-08 · 💻 cs.DB

LocationSpark: In-memory Distributed Spatial Query Processing and Optimization

Mingjie Tang , Yongyang Yu , Walid G. Aref , Ahmed R. Mahmood , Qutaibah M. Malluhi , Mourad Ouzzani This is my paper

Pith reviewed 2026-05-25 00:42 UTC · model grok-4.3

classification 💻 cs.DB

keywords spatial query processingdistributed systemsquery optimizationbitmap filtersquery skewin-memory computingSparkspatial indexing

0 comments

The pith

A distributed query scheduler with a new cost model and bitmap filters lets spatial queries run up to an order of magnitude faster inside Spark by reducing skew effects.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents LocationSpark as an extension inside Spark that adds a scheduler to handle query skew common in spatial workloads. The scheduler relies on a cost model to build execution plans and on bitmap filters to route each query to the right nodes, after which each node runs its own local optimization. This combination is meant to cut both skew-induced imbalance and communication volume. If the approach works, large spatial datasets from mapping or location services could be queried at scale without the slowdowns seen in prior distributed engines. The authors prototype the full set of techniques and test them on real data to show the claimed gains.

Core claim

LocationSpark shows that a distributed query scheduler using a new cost model to generate skew-minimizing plans, together with bitmap-filter spatial indexes that forward queries to the correct local nodes, enables each node to select its own best execution plan; when implemented in Spark this produces up to an order of magnitude faster spatial query processing than existing in-memory and distributed spatial systems.

What carries the argument

The distributed query scheduler that applies a new cost model to plan executions and bitmap filters to forward queries, plus per-node local optimization based on local indexes.

If this is right

Query skew effects are reduced by the scheduler's cost-model plans.
Communication costs drop because bitmap filters limit unnecessary data movement.
Each local node independently chooses the best plan for its data and query type.
The full set of techniques runs inside the Spark execution engine.
Real-dataset experiments confirm up to 10x gains over prior distributed spatial engines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same scheduler-plus-filter pattern could be ported to other in-memory engines that already support spatial indexes.
Dynamic adjustment of the cost model might handle streaming spatial data where skew changes over time.
The routing mechanism could be combined with existing spatial partitioning methods to further lower cross-node traffic.
Similar bitmap-filter forwarding might help non-spatial workloads that also suffer from data skew, such as graph traversals.

Load-bearing premise

The new cost model and bitmap-filter forwarding correctly identify and mitigate query skew without introducing offsetting overheads or incorrect routing decisions.

What would settle it

Run identical high-skew spatial range and join queries on the same real datasets using both LocationSpark and an existing system such as GeoSpark, then measure whether the observed runtime ratio stays near 10x or falls below 2x once scheduler and filter overhead is included.

Figures

Figures reproduced from arXiv: 1907.03736 by Ahmed R. Mahmood, Mingjie Tang, Mourad Ouzzani, Qutaibah M. Malluhi, Walid G. Aref, Yongyang Yu.

**Figure 2.** Figure 2: LOCATIONSPARK system architecture 2.2 Overview of In-memory Distributed Spatial Query Processing To facilitate spatial query processing, we build a distributed spatial index for in-memory spatial data. Given a spatial dataset D, we obtain samples from D and construct a spatial index (e.g., an R-tree) with N leaves over the samples. We refer to this index on the sample data as the global spatial index. Ne… view at source ↗

**Figure 3.** Figure 3: Execution plan for spatial range join. The red lines [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Evaluation of local spatial join algorithms [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: a gives the performance of the specialized kNN join approaches within a local computation node when varying k from 10 to 150. The nestQtree approach always performs the best, followed by nestRtree, sfcurve, pgbjk, and spitfire. Notice that block-based approaches induce extensive amounts of kNN candidates for query points in the same block, and it directly degrades the 1 2 3 4 5 6 7 8 9 10 x 104 103 104 105… view at source ↗

**Figure 6.** Figure 6: sFilter structure (up left), the related data (up [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: The performance of spatial range join 0 2 4 6 8 10 12 14 16 18 20 102 103 104 Data Size of Inner Table (Millon) Query Time(seconds) log LocationSpark LocationSpark(opt) Simba Simba(opt) (a) Twitter 0 10 20 30 40 50 60 70 80 90 100 102 103 104 Data Size of Inner Table (Millon) Query Time(seconds) log LocationSpark LocationSpark(opt) Simba Simba(opt) (b) OSMP [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗

**Figure 8.** Figure 8: Performance of kNN join by increasing the number of data points achieve 5 times speedup against with Simba, since sFilter can reduce redundant data partitions searching. This is also observed from the Simba(opt) (i.e., with sFilter) can achieve comparable performance with LOCATIONSPARK. Performance results (mainly, the execution times of the spatial range join) are listed in [PITH_FULL_IMAGE:figures/ful… view at source ↗

**Figure 9.** Figure 9: Performance of spatial range join on various query [PITH_FULL_IMAGE:figures/full_fig_p012_9.png] view at source ↗

**Figure 10.** Figure 10: The effect of the sFilter on reducing the shuffle [PITH_FULL_IMAGE:figures/full_fig_p013_10.png] view at source ↗

**Figure 11.** Figure 11: Performance of spatial range join and kNN join when varying the number of executors overviews. Gaede and Gunther [10] provide a good ¨ summary of spatial data indexing. Sowell et al. give a survey and experimental study of iterative spatial-join in memory [20]. Recently, there has been considerable interest in supporting spatial data management over Hadoop MapReduce. Hadoop-GIS [3] supports spatial querie… view at source ↗

read the original abstract

Due to the ubiquity of spatial data applications and the large amounts of spatial data that these applications generate and process, there is a pressing need for scalable spatial query processing. In this paper, we present new techniques for spatial query processing and optimization in an in-memory and distributed setup to address scalability. More specifically, we introduce new techniques for handling query skew, which is common in practice, and optimize communication costs accordingly. We propose a distributed query scheduler that use a new cost model to optimize the cost of spatial query processing. The scheduler generates query execution plans that minimize the effect of query skew. The query scheduler employs new spatial indexing techniques based on bitmap filters to forward queries to the appropriate local nodes. Each local computation node is responsible for optimizing and selecting its best local query execution plan based on the indexes and the nature of the spatial queries in that node. All the proposed spatial query processing and optimization techniques are prototyped inside Spark, a distributed memory-based computation system. The experimental study is based on real datasets and demonstrates that distributed spatial query processing can be enhanced by up to an order of magnitude over existing in-memory and distributed spatial systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LocationSpark adds a scheduler with cost model plus bitmap routing to Spark for spatial skew, but the abstract gives almost no experimental details so the order-of-magnitude claim is hard to judge.

read the letter

The paper's main contribution is a set of techniques inside Spark for spatial queries that suffer from skew: a distributed scheduler that uses a new cost model to build execution plans, bitmap filters to route queries to the right nodes, and local optimization at each node. These are presented as ways to cut communication and balance load for location data workloads. The abstract says the system was built in Spark and tested on real datasets, with gains up to an order of magnitude over prior in-memory and distributed spatial systems. That is the concrete thing the work offers: a practical integration of these pieces rather than a wholly new paradigm. The local-node optimization step and the focus on query skew in a Spark setting are reasonable extensions of earlier distributed query work. The prototyping choice is sensible for a systems paper and the mention of real datasets is a small positive. The soft spots are more noticeable. The abstract states the big performance claim but supplies no baselines, no dataset sizes or characteristics, no error bars, and no description of how the cost model is computed or how the bitmaps are constructed and maintained. Without those, it is impossible to tell whether the scheduler and filters actually reduce skew or simply trade one cost for another. The stress-test note is on target here: if the cost model undercounts communication or the bitmaps produce bad routing, the net benefit could vanish, and the abstract gives no evidence either way. This is a system-description paper aimed at people who build or tune distributed spatial engines on Spark. A reader already working in that area might pick up useful implementation ideas from the architecture, but anyone wanting to rely on the performance numbers would need the full experimental section. The work is coherent on its own terms and engages the literature at the level of a systems paper, so it clears the bar for peer review even though the current evidence is thin. I would send it out rather than desk-reject, with the expectation that referees will press hard on the experiments and the cost-model validation.

Referee Report

2 major / 1 minor

Summary. The manuscript presents LocationSpark, a Spark-based prototype for in-memory distributed spatial query processing. It introduces a distributed query scheduler that employs a new cost model to generate execution plans minimizing query skew effects, bitmap-filter-based spatial indexing to forward queries to appropriate local nodes, and per-node local optimization of query plans. Experiments on real datasets are reported to demonstrate up to an order-of-magnitude improvement over existing in-memory and distributed spatial systems.

Significance. If the performance claims hold under rigorous validation, the work would be significant for practical distributed spatial analytics by directly targeting query skew, a frequent real-world bottleneck, and by providing a Spark-integrated implementation that could facilitate adoption and further extensions in scalable spatial data systems.

major comments (2)

[Abstract and experimental study] Abstract and § on experimental study: the central claim of 'up to an order of magnitude' improvement supplies no baselines, dataset sizes or characteristics, error bars, or query types, preventing evaluation of whether the cost model and bitmap filters deliver net gains or are offset by their own overheads.
[Distributed query scheduler] Description of the distributed query scheduler: no equations, pseudocode, or formal definition is given for the new cost model that is asserted to optimize communication and local computation costs while mitigating skew; without this it is impossible to determine whether the model introduces incorrect routing or extra data movement as raised by the stress-test concern.

minor comments (1)

The abstract could more explicitly list the supported spatial query types (e.g., range, kNN) and the exact Spark version or extensions used.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments highlight opportunities to improve the clarity of our performance claims and the formal presentation of the cost model. We address each point below and commit to revisions that strengthen the manuscript without altering its core contributions.

read point-by-point responses

Referee: [Abstract and experimental study] Abstract and § on experimental study: the central claim of 'up to an order of magnitude' improvement supplies no baselines, dataset sizes or characteristics, error bars, or query types, preventing evaluation of whether the cost model and bitmap filters deliver net gains or are offset by their own overheads.

Authors: We agree that the abstract and experimental study section would benefit from additional specifics to allow readers to fully evaluate the claims. In the revised manuscript, we will update the abstract to reference the real-world datasets (including sizes and characteristics), the query types evaluated, and the baselines compared against. The experimental study section will be expanded to explicitly report dataset details, include error bars on performance results, specify query workloads, and provide direct comparisons demonstrating that reported gains exceed any overhead from the cost model and bitmap filters. revision: yes
Referee: [Distributed query scheduler] Description of the distributed query scheduler: no equations, pseudocode, or formal definition is given for the new cost model that is asserted to optimize communication and local computation costs while mitigating skew; without this it is impossible to determine whether the model introduces incorrect routing or extra data movement as raised by the stress-test concern.

Authors: We acknowledge that a more formal and precise presentation of the cost model is needed. In the revision, we will add the mathematical equations defining the cost model (capturing communication, local computation, and skew mitigation terms), along with pseudocode for the distributed query scheduler. This will enable assessment of routing decisions and data movement. We will also expand the discussion to address potential stress-test scenarios, including any analysis of incorrect routing or overhead. revision: yes

Circularity Check

0 steps flagged

No circularity; system description with external experimental validation

full rationale

The paper presents a distributed spatial query system prototyped in Spark, introducing a scheduler with a cost model and bitmap filters for skew handling. No equations, fitted parameters, or self-citations are invoked as load-bearing derivations. Claims rest on experimental results using real datasets compared against existing systems, which are independent benchmarks. This matches the default case of a self-contained engineering contribution with no reduction of outputs to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract states no free parameters, mathematical axioms, or new invented entities; the contribution consists of engineering techniques and a prototype implementation.

pith-pipeline@v0.9.0 · 5752 in / 1026 out tokens · 27527 ms · 2026-05-25T00:42:14.165767+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages

[1]

https://github.com/harsha2010/magellan

Magellan. https://github.com/harsha2010/magellan

work page
[2]

http://simin.me/projects/spatialspark/

Spatialspark. http://simin.me/projects/spatialspark/

work page
[3]

A. Aji, F. Wang, H. Vo, R. Lee, Q. Liu, X. Zhang, and J. Saltz. Hadoop gis: A high performance spatial data warehousing system over mapreduce. VLDB, 2013

work page 2013
[4]

Alexiou, D

K. Alexiou, D. Kossmann, and P .-A. Larson. Adaptive range ﬁlters for cold data: Avoiding trips to siberia. In VLDB, 2013

work page 2013
[5]

A. M. Aly, A. R. Mahmood, M. S. Hassan, W. G. Aref, M. Ouzzani, H. Elmeleegy, and T. Qadah. AQWA: adaptive query-workload- aware partitioning of big spatial data. In VLDB, 2015

work page 2015
[6]

Brinkhoff, H.-P

T. Brinkhoff, H.-P . Kriegel, and B. Seeger. Efﬁcient processing of spatial joins using r-trees. SIGMOD Rec., 1993

work page 1993
[7]

Calderoni, P

L. Calderoni, P . Palmieri, and D. Maio. Location privacy without mutual trust. Comput. Commun

work page
[8]

Chatzimilioudis, C

G. Chatzimilioudis, C. Costa, D. Zeinalipour-Yazti, W. Lee, and E. Pitoura. Distributed in-memory processing of all k nearest neighbor queries. TKDE, 2016

work page 2016
[9]

Eldawy and M

A. Eldawy and M. Mokbel. Spatialhadoop: A mapreduce frame- work for spatial data. In ICDE, 2015

work page 2015
[10]

Gaede and O

V . Gaede and O. G ¨unther. Multidimensional access methods. ACM Comput. Surv. , 30:170–231, 1998

work page 1998
[11]

M. R. Garey and D. S. Johnson. Computers and Intractability: A Guide to the Theory of NP-Completeness . W. H. Freeman & Co., New York, NY, USA, 1979

work page 1979
[12]

J. E. Gonzalez, R. S. Xin, A. Dave, D. Crankshaw, M. J. Franklin, and I. Stoica. Graphx: Graph processing in a distributed dataﬂow framework. In OSDI, 2014

work page 2014
[13]

Y. Kwon, M. Balazinska, B. Howe, and J. Rolia. Skew-resistant parallel processing of feature-extracting scientiﬁc user-deﬁned functions. In SoCC, 2010

work page 2010
[14]

Y. Kwon, M. Balazinska, B. Howe, and J. Rolia. Skewtune: Mitigating skew in mapreduce applications. In SIGMOD, 2012

work page 2012
[15]

W. Lu, Y. Shen, S. Chen, and B. C. Ooi. Efﬁcient processing of k nearest neighbor joins using mapreduce. Proc. VLDB Endow. , 2012

work page 2012
[16]

Mingjie, Y

T. Mingjie, Y. Yongyang, G. A. Walid, M. M. Qutaibah, O. Mourad, and R. Ahmed. Locationspark: A distributed in-memory data management system for big spatial data. Purdue technical report , 2016

work page 2016
[17]

Nishimura, S

S. Nishimura, S. Das, D. Agrawal, and A. Abbadi. Md-hbase: A scalable multi-dimensional data infrastructure for location aware services. In MDM 11

work page
[18]

H. Samet. Foundations of Multidimensional and Metric Data Struc- tures. Morgan Kaufmann Publishers Inc., 2005

work page 2005
[19]

Shvachko, H

K. Shvachko, H. Kuang, S. Radia, and R. Chansler. The hadoop distributed ﬁle system. In MSST, 2010

work page 2010
[20]

Sowell, M

B. Sowell, M. V . Salles, T. Cao, A. Demers, and J. Gehrke. An experimental analysis of iterated spatial joins in main memory. Proc. VLDB Endow. , 2013

work page 2013
[21]

M. Tang, Y. Yu, W. G. Aref, Q. Malluhi, and M. Ouzzani. Loca- tionspark: A distributed in-memory data management system for big spatial data. VLDB, 2016

work page 2016
[22]

J. S. Vitter. Random sampling with a reservoir. ACM Trans. Math. Softw., 1985

work page 1985
[23]

C. Xia, H. Lu, B. Chin, and O. J. Hu. Gorder: An efﬁcient method for knn join processing. In VLDB, 2004

work page 2004
[24]

D. Xie, F. Li, B. Yao, G. Li, L. Zhou, and M. Guo. Simba: Efﬁcient in-memory spatial analytics. In SIGMOD, 2016

work page 2016
[25]

J. Yu, J. Wu, and M. Sarwat. Geospark: A cluster computing framework for processing large-scale spatial data. In SIGSP ATIAL, 2015

work page 2015
[26]

M. Zaharia. An Architecture for Fast and General Data Processing on Large Clusters . Association for Computing Machinery and Morgan, 2016. PLACE PHOTO HERE Mingjie Tang is member of technical stuff at Hortonworks. He won his PhD at Purdue Uni- versity. His research interests include database system and big data infrastructure. He has an MS in computer sci...

work page 2016

[1] [1]

https://github.com/harsha2010/magellan

Magellan. https://github.com/harsha2010/magellan

work page

[2] [2]

http://simin.me/projects/spatialspark/

Spatialspark. http://simin.me/projects/spatialspark/

work page

[3] [3]

A. Aji, F. Wang, H. Vo, R. Lee, Q. Liu, X. Zhang, and J. Saltz. Hadoop gis: A high performance spatial data warehousing system over mapreduce. VLDB, 2013

work page 2013

[4] [4]

Alexiou, D

K. Alexiou, D. Kossmann, and P .-A. Larson. Adaptive range ﬁlters for cold data: Avoiding trips to siberia. In VLDB, 2013

work page 2013

[5] [5]

A. M. Aly, A. R. Mahmood, M. S. Hassan, W. G. Aref, M. Ouzzani, H. Elmeleegy, and T. Qadah. AQWA: adaptive query-workload- aware partitioning of big spatial data. In VLDB, 2015

work page 2015

[6] [6]

Brinkhoff, H.-P

T. Brinkhoff, H.-P . Kriegel, and B. Seeger. Efﬁcient processing of spatial joins using r-trees. SIGMOD Rec., 1993

work page 1993

[7] [7]

Calderoni, P

L. Calderoni, P . Palmieri, and D. Maio. Location privacy without mutual trust. Comput. Commun

work page

[8] [8]

Chatzimilioudis, C

G. Chatzimilioudis, C. Costa, D. Zeinalipour-Yazti, W. Lee, and E. Pitoura. Distributed in-memory processing of all k nearest neighbor queries. TKDE, 2016

work page 2016

[9] [9]

Eldawy and M

A. Eldawy and M. Mokbel. Spatialhadoop: A mapreduce frame- work for spatial data. In ICDE, 2015

work page 2015

[10] [10]

Gaede and O

V . Gaede and O. G ¨unther. Multidimensional access methods. ACM Comput. Surv. , 30:170–231, 1998

work page 1998

[11] [11]

M. R. Garey and D. S. Johnson. Computers and Intractability: A Guide to the Theory of NP-Completeness . W. H. Freeman & Co., New York, NY, USA, 1979

work page 1979

[12] [12]

J. E. Gonzalez, R. S. Xin, A. Dave, D. Crankshaw, M. J. Franklin, and I. Stoica. Graphx: Graph processing in a distributed dataﬂow framework. In OSDI, 2014

work page 2014

[13] [13]

Y. Kwon, M. Balazinska, B. Howe, and J. Rolia. Skew-resistant parallel processing of feature-extracting scientiﬁc user-deﬁned functions. In SoCC, 2010

work page 2010

[14] [14]

Y. Kwon, M. Balazinska, B. Howe, and J. Rolia. Skewtune: Mitigating skew in mapreduce applications. In SIGMOD, 2012

work page 2012

[15] [15]

W. Lu, Y. Shen, S. Chen, and B. C. Ooi. Efﬁcient processing of k nearest neighbor joins using mapreduce. Proc. VLDB Endow. , 2012

work page 2012

[16] [16]

Mingjie, Y

T. Mingjie, Y. Yongyang, G. A. Walid, M. M. Qutaibah, O. Mourad, and R. Ahmed. Locationspark: A distributed in-memory data management system for big spatial data. Purdue technical report , 2016

work page 2016

[17] [17]

Nishimura, S

S. Nishimura, S. Das, D. Agrawal, and A. Abbadi. Md-hbase: A scalable multi-dimensional data infrastructure for location aware services. In MDM 11

work page

[18] [18]

H. Samet. Foundations of Multidimensional and Metric Data Struc- tures. Morgan Kaufmann Publishers Inc., 2005

work page 2005

[19] [19]

Shvachko, H

K. Shvachko, H. Kuang, S. Radia, and R. Chansler. The hadoop distributed ﬁle system. In MSST, 2010

work page 2010

[20] [20]

Sowell, M

B. Sowell, M. V . Salles, T. Cao, A. Demers, and J. Gehrke. An experimental analysis of iterated spatial joins in main memory. Proc. VLDB Endow. , 2013

work page 2013

[21] [21]

M. Tang, Y. Yu, W. G. Aref, Q. Malluhi, and M. Ouzzani. Loca- tionspark: A distributed in-memory data management system for big spatial data. VLDB, 2016

work page 2016

[22] [22]

J. S. Vitter. Random sampling with a reservoir. ACM Trans. Math. Softw., 1985

work page 1985

[23] [23]

C. Xia, H. Lu, B. Chin, and O. J. Hu. Gorder: An efﬁcient method for knn join processing. In VLDB, 2004

work page 2004

[24] [24]

D. Xie, F. Li, B. Yao, G. Li, L. Zhou, and M. Guo. Simba: Efﬁcient in-memory spatial analytics. In SIGMOD, 2016

work page 2016

[25] [25]

J. Yu, J. Wu, and M. Sarwat. Geospark: A cluster computing framework for processing large-scale spatial data. In SIGSP ATIAL, 2015

work page 2015

[26] [26]

M. Zaharia. An Architecture for Fast and General Data Processing on Large Clusters . Association for Computing Machinery and Morgan, 2016. PLACE PHOTO HERE Mingjie Tang is member of technical stuff at Hortonworks. He won his PhD at Purdue Uni- versity. His research interests include database system and big data infrastructure. He has an MS in computer sci...

work page 2016